If you are not an expert data scientist with a history of many years of experience in tuning different algorithms to different types of tasks in the ML domain, or if you are one, but just want to generate a baseline of testing different models and parameters - AutoML to the rescue! In this notebook we will use the automated machine learning capabilities in Azure Machine Learning to create a model that performs best given our dataset, our problem and what we want to predict.

More about the AutoML capabilities in Azure Machine Learning here: https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml

To run the contents of the cell, press ctrl+enter :)


Import the libraries:

In [18]:
import logging
import os
import json

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.core.dataset import Dataset
from azureml.train.automl import AutoMLConfig

from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget

Connect to the Azure Machine Learning Workspace & the SecurityBugClassification experiment that has been created earlier:

In [19]:
ws = Workspace.from_config()

# automl_folder = './automl_output/'
automl_folder = os.path.join(os.getcwd(), 'automl_output')
exp_name = 'SecurityBugClassification'

experiment = Experiment(ws, exp_name)

The dataset has been uploaded and registered in the workspace, so we just need to get it from there:

In [20]:
#import dataset from ws:
dataset = Dataset.get_by_name(ws, name='SecBugDatasetLabelL2')

X = dataset.keep_columns(columns=['summary'])
y = dataset.keep_columns(columns=['Label L2'])

We are again using the result from one of the labelers, found in the column L2:

In [21]:
print(X.take(5).to_pandas_dataframe())
print(y.take(5).to_pandas_dataframe())

                                             summary
0  Drag and drop of learning design zip file not ...
1        Error Launching Compendium LD after install
2  Multiple column Arrange for map of unlinked no...
3  Facility to assign node icon sets on a per map...
4                                      Spell Checker
   Label L2
0         0
1         0
2         0
3         0
4         0


In [22]:
# create compute resource that I will be using for running automl
# If a cluster by that name already exist, use it

from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
import os


# choose a name for your cluster
compute_name = os.environ.get('AML_COMPUTE_CLUSTER_NAME', 'cpu-cluster4n')

# I'll construct a cluster of nodes 0-4 because
# I want the cluster to shut down when not in use - if min nodes is 0 it autoshutdown when not in use
compute_min_nodes = os.environ.get('AML_COMPUTE_CLUSTER_MIN_NODES', 0)
compute_max_nodes = os.environ.get('AML_COMPUTE_CLUSTER_MAX_NODES', 4)

# This example uses CPU VM. For using GPU VM, set SKU to STANDARD_NC6
vm_size = os.environ.get('AML_COMPUTE_CLUSTER_SKU', 'STANDARD_D2_V2')


if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('found compute target. just use it. ' + compute_name)
else:
    print('creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size=vm_size,
                                                                min_nodes=compute_min_nodes, 
                                                                max_nodes=compute_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
     # For a more detailed view of current AmlCompute status, use get_status()
    print(compute_target.get_status().serialize())

found compute target. just use it. cpu-cluster4n


We'll configure a Conda environment for the AutoML job to use during training:

In [23]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
import pkg_resources

conda_run_config = RunConfiguration(framework="python")

conda_run_config.target = compute_target
conda_run_config.environment.docker.enabled = True

cd = CondaDependencies.create(conda_packages=['numpy','scikit-learn','py-xgboost<=0.80'],
                              pip_packages=['azureml-train-automl'])

conda_run_config.environment.python.conda_dependencies = cd

The AutoML configuration will define various settings that are used, like how many iterations we want it to test (num of algorithm and parameter combinations to test) and what kind of metric to measure on. We'll also tell it to preprocess the data which will take care of the vectorization of the text. We'll tell it to use 5-fold cross validation since we dont have that much data.

In [24]:
automl_settings = {
    "iteration_timeout_minutes": 5,
    "iterations": 20,
    "n_cross_validations": 5,
    "primary_metric": 'AUC_weighted',
    "preprocess": True,
    "max_concurrent_iterations": 3,
    "enable_early_stopping": True,
    "verbosity": logging.INFO
}

automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             path = automl_folder,
                             run_configuration=conda_run_config,
                             X = X,
                             y = y,
                             **automl_settings)

remote_run = experiment.submit(automl_config, show_output = False)



Running on remote or ADB.


We can use a widget to watch the progress of the run in the notebook:

In [25]:
from azureml.widgets import RunDetails
RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

In [28]:
remote_run.wait_for_completion(show_output = False)

{'runId': 'AutoML_caffbf1d-0760-45fd-81f3-2eff0be8255b',
 'target': 'cpu-cluster4n',
 'status': 'Completed',
 'startTimeUtc': '2020-05-20T08:19:40.103043Z',
 'endTimeUtc': '2020-05-20T08:41:01.979531Z',
 'properties': {'num_iterations': '20',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'AUC_weighted',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '5',
  'target': 'cpu-cluster4n',
  'DataPrepJsonString': '{\\"X\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"f235350e-ac5c-444f-966b-b20b13529bd6\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetDatastoreFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"datastores\\\\\\": [{\\\\\\"datastoreName\\\\\\": \\\\\\"workspaceblobstore\\\\\\", \\\\\\"path\\\\\\": \\\\\\"UI/05-09-2020_012911_UTC/secBugDataLabelColL2.csv\\\\\\", \\\\\\"resourceGroup\\\\\\": \\\\\\"cdwaisec\\\\\\", \\\\\\"subscription\\\\\\": \\\\\\"7f150ec6-cc4b-4575-b242-1d8de759c3ab\\\\\\", \

Once the Automl job is finished we will have a bunch of different runs with trained models including preprocessing. We can inspect these in the portal, and likely we want to deploy the one with the best results:

In [29]:
best_run, fitted_model = remote_run.get_output()
print("Run:", best_run)
print("Model:", fitted_model)

Run: Run(Experiment: SecurityBugClassification,
Id: AutoML_caffbf1d-0760-45fd-81f3-2eff0be8255b_18,
Type: azureml.scriptrun,
Status: Completed)
Model: Pipeline(memory=None,
     steps=[('datatransformer', DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
        feature_sweeping_config=None, feature_sweeping_timeout=None,
        featurization_config=None, force_text_dnn=None,
        is_cross_validation=None, is_onnx_compatible=None, logger=None,
        obser...6666666666667, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.6666666666666666]))])


Lets do a quick check:

In [30]:
test = pd.DataFrame(['The formatting of the form is skewed','password and username can be leaked', 'administrators have way too many permissions'], columns = ['summary'])
fitted_model.predict(test)

array([0, 1, 1], dtype=int64)

Looks good, we want to register this one and deploy it to ACI and make a REST endpoint available so that it can be called from another application that needs to classify security bugs based on the title!

In [31]:
description = 'AutoML Model for Security Bug Classification'
tags = {"dataset": "SecBugDatasetLabelL2"}
model = remote_run.register_model(description = description, tags = tags)

print(model)

Model(workspace=Workspace.create(name='cdwaisecws', subscription_id='7f150ec6-cc4b-4575-b242-1d8de759c3ab', resource_group='cdwaisec'), name=AutoMLcaffbf1d018, id=AutoMLcaffbf1d018:1, version=1, tags={'dataset': 'SecBugDatasetLabelL2'}, properties={})


As before, for deployment all we need is:

* A scoring script to show how to use the model
* An environment file to show what packages need to be installed
* A configuration file to build the ACI
* The model we trained before

Important to notice: you will have to switch the value of the model_name parameter below with results from your own run in the output from the cell above

In [46]:
%%writefile ./deploy-model/automlscore.py
import os
import pickle
import json
import pandas as pd
import azureml.train.automl
from sklearn.externals import joblib
from azureml.core.model import Model

def init():
    global model
    model_path = Model.get_model_path(model_name = "AutoMLcaffbf1d018") #model_name should be the output parameter "id" from the previous cell, up until the colon
    model = joblib.load(model_path)

def run(rawdata):
    try:
        data = json.loads(rawdata)['text']
        result = model.predict(pd.DataFrame(data, columns = ['summary']))
    except Exception as e:
        result = str(e)
        return json.dumps({"error": result})
    return json.dumps({"result": result.tolist()})

Overwriting ./deploy-model/automlscore.py


Next, create an environment file:

In [36]:
conda_env_file_name = 'automl-secbugclassification-env.yml'
cd.save_to_file('./deploy-model/', conda_env_file_name)

'automl-secbugclassification-env.yml'

Create a deployment configuration file:

In [34]:
from azureml.core.webservice import AciWebservice

aciconfig = AciWebservice.deploy_configuration(cpu_cores=1, 
                                               memory_gb=1, 
                                               tags={"data": "SecBugDataset",  "method" : "automl"}, 
                                               description='Predict Security Bugs with automl model')

Configure the image and deploy:

In [47]:
%%time
from azureml.core.webservice import Webservice
from azureml.core.model import InferenceConfig, Model
from azureml.core.environment import Environment

scorefile = os.path.join(os.getcwd(), 'deploy-model','automlscore.py')
myenvfile = os.path.join(os.getcwd(), 'deploy-model','automl-secbugclassification-env.yml')

myenv = Environment.from_conda_specification(name="myenv", file_path=myenvfile)
inference_config = InferenceConfig(entry_script=scorefile, environment=myenv)

service = Model.deploy(workspace=ws, 
                       name='secbug-automl-svc-7', 
                       models=[model], 
                       inference_config=inference_config, 
                       deployment_config=aciconfig)

service.wait_for_deployment(show_output=True)



Running................................................
Succeeded
ACI service creation operation finished, operation "Succeeded"
Wall time: 4min 27s


Get the scoring web service's HTTP endpoint, which accepts REST client calls. This endpoint can be shared with anyone who wants to test the web service or integrate it into an application:

In [50]:
print(service.scoring_uri)

http://28e022e1-54ae-454a-8540-d9ce4f32b534.westeurope.azurecontainer.io/score


Now we can test the deployed model:

In [49]:
import requests
import json

headers = {'Content-Type':'application/json'}
data = {"text": ['The formatting of the form is skewed','password and username can be leaked', 'administrators have way too many permissions']}

test_samples = json.dumps(data)
print(test_samples)

resp = requests.post(service.scoring_uri, json=data, headers=headers)
print("Prediction Results:", resp.json())

{"text": ["The formatting of the form is skewed", "password and username can be leaked", "administrators have way too many permissions"]}
Prediction Results: {"result": [0, 1, 1]}


In [51]:
print("Bug Titles to score:", test_samples)

resp = requests.post(service.scoring_uri, json=data, headers=headers)
print("Prediction Results:", resp.json())

Bug Titles to score: {"text": ["The formatting of the form is skewed", "password and username can be leaked", "administrators have way too many permissions"]}
Prediction Results: {"result": [0, 1, 1]}
