Introtext

To run the contents of the cell, press ctrl+enter :)


The first thing we need to do is connect to the Azure Machine Learning Workspace:

In [None]:
from azureml.core import Workspace, Dataset

ws=Workspace.from_config()

The dataset has been uploaded and registeres in the workspace, so we just need to get it from there

In [None]:
#import dataset from ws:
dataset = Dataset.get_by_name(ws, name='SecBugDataset')
df = dataset.to_pandas_dataframe()

# this is the dataset we want to use:
df.head()

We'll use the result from one of the labelers, found in the column L2. 

In [None]:
# these are the bugs labeled as security bugs by labeler 2

sb = df[df['L2'] == 'Integrity/Security']
print(sb)

In [None]:
# how many bugs are labeled as security bugs? 

print(len(sb))

First, lets run through locally what we want to deploy to run in the cloud:


First of all, there are a couple of libraries we need to import. Internal AML libraries to work with datasets and to handle training runs that will be logged to our experiment, and libraries from the machine learning framework "Scikit Learn" which contains functionality for using a classifier algorithm to train on our dataset and output a model.  


In [None]:
import os
import math
import string
import numpy as np

from azureml.core import Dataset, Run
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, accuracy_score 
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.externals import joblib

We could experiment with different labelers, but for now we will use the results from labeler 1, so we create a new column "Label", with the contents from the column with labels from labeler 1:


In [None]:
df['Label'] = [1 if x =='Integrity/Security' else 0 for x in df['L2']]

The field we want to use to predict if it is a security bug or not is
the "summary" field. It is a short text, and the text must be translated into a 
representation the machine learning alogrithm understand. For this we use the tf-idf vectorization algorithm:


In [None]:
#do the vectorization - tf-idf

vectorizer = TfidfVectorizer(min_df=2)
tfidf = vectorizer.fit_transform(df['summary'])
tfidf = tfidf.toarray()

Now the vectorizer has built a vector that represents the summary field, 
built by using the number of times a word is present in a text, weighted by
how many texts the word occurs in overall, in all the texts. 
We got a matrix with all our texts along the y-axis and the words along the x-axis:

In [None]:
print(tfidf.shape)
words = vectorizer.get_feature_names()
print(len(words))
print(words[10:20])

Lets create a column in our dataframe with the vectors representing the summary text:

In [None]:
df['summary_vec'] = list(tfidf)

print(df.head())

What we want to do now is take X - all the texts (the summary column) 
in their vector representation - and y - the column we are using to 
predict them (The label column) - and split them into one training set and one test set.
We'll use the first portion to train a classifier algorithm, and the 
second portion to test the classifier afterwards, to see how well it performed:

In [None]:
# split the dataset into test and train 

x = df['summary_vec'].tolist()
y = df['Label'].tolist()
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2,stratify=y, random_state=66)

Now we want to create and train the model:

In [None]:
model = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr')
model.fit(X=X_train, y=y_train)

Lets do the predictions on the test set, data that the classifier wasn't trained on:

In [None]:
y_pred = model.predict(X=X_test)

After predicting let's use some common measures for performance and test how well our model perform::

In [None]:
auc_weighted = roc_auc_score(y_test, y_pred,average="weighted")
accuracy = accuracy_score(y_test, y_pred)

print(auc_weighted)
print(accuracy)

We dont want to just run this locally, we would like to run this in a compute cluster in the cloud, and we want to be able to track metrics on how this training performed and so on, and make it available to others in our team

In [None]:
# create Experiment, my container for Runs

from azureml.core import Experiment

experiment = Experiment(workspace=ws, name="SecurityBugClassification")

In [None]:
# create compute resource that I will be using for training my classifier
# If a cluster by that name already exist, use it

from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
import os


# choose a name for your cluster
compute_name = os.environ.get('AML_COMPUTE_CLUSTER_NAME', 'cpu-cluster')

# I'll construct a cluster of nodes 0-1 because
# I'll be working with scikit-learn and there's no scaling out to sev 
# nodes but I want the cluster to shut down when not in use
compute_min_nodes = os.environ.get('AML_COMPUTE_CLUSTER_MIN_NODES', 0)
compute_max_nodes = os.environ.get('AML_COMPUTE_CLUSTER_MAX_NODES', 1)

# This example uses CPU VM. For using GPU VM, set SKU to STANDARD_NC6
vm_size = os.environ.get('AML_COMPUTE_CLUSTER_SKU', 'STANDARD_D2_V2')


if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('found compute target. just use it. ' + compute_name)
else:
    print('creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size=vm_size,
                                                                min_nodes=compute_min_nodes, 
                                                                max_nodes=compute_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
     # For a more detailed view of current AmlCompute status, use get_status()
    print(compute_target.get_status().serialize())

Now we want to submit a job to run on the remote training cluster we have created. To do that we need to:

* Create a training script
* Create an estimator object
* Submit the job 

We will put the files that will be copied to the remote cluster nodes for execution in the folder "train-dataset":

In [None]:
script_folder = os.path.join(os.getcwd(), 'train-dataset')

The directory must contain a file with the training script you want to run. For better visibiilty into what the script does, we'll create the file here and add it to the directory we just created:

In [None]:
%%writefile $script_folder/train.py

import os
import math
import string
import numpy as np

from azureml.core import Dataset, Run
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, accuracy_score 
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.externals import joblib


run = Run.get_context()

# get input dataset by name
dataset = run.input_datasets['SecBugDataset']

df = dataset.to_pandas_dataframe()


# create column used as target

df['Label'] = [1 if x =='Integrity/Security' else 0 for x in df['L2']]

# do the vectorization - tf-idf

vectorizer = TfidfVectorizer(min_df=2)
tfidf = vectorizer.fit_transform(df['summary'])
tfidf = tfidf.toarray()

# create our feature column
df['summary_vec'] = list(tfidf)

#dividing X,y into train and test data
X_train, X_test, y_train, y_test = train_test_split(df['summary_vec'].tolist(), df['Label'].tolist(), test_size=0.2, random_state=66)


# create our classifier & train it
model = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr')
model.fit(X=X_train, y=y_train)

# make predictions to see how well it does
y_pred = model.predict(X=X_test)

# measure it with to different metrics
auc_weighted = roc_auc_score(y_test, y_pred,average="weighted")
accuracy = accuracy_score(y_test, y_pred)

# log the metrics we want to track and measure on to the Run
run.log("AUC_Weighted", auc_weighted)
run.log("Accuracy", accuracy)

model_file_name = 'LogRegModel.pkl'


# The training script saves the model into a directory named ‘outputs’. Files saved in the 
# outputs folder are automatically uploaded into experiment record. Anything written in this 
# directory is automatically uploaded into the workspace.
os.makedirs('./outputs', exist_ok=True)
with open(model_file_name, 'wb') as file:
    joblib.dump(value=model, filename='outputs/' + model_file_name)

In our training script we log important metrics to the current run, as well as saving the model created into a directory called 'outputs' that will be uploaded to the workspace and available through our run object when the training (run) is completed. Now, we need to create an estimator object that contains the run configuration:

In [None]:
from azureml.train.sklearn import SKLearn

est = SKLearn(source_directory=script_folder, 
              entry_script='train.py', 
              inputs=[dataset.as_named_input('SecBugDataset')],
              #environment_definition=env,
              pip_packages=['azureml-dataprep[pandas]'],
              compute_target=compute_target) 

... and we submit this to the Experiment it belongs to:

In [None]:
run = experiment.submit(config=est)
run

In [None]:
run.wait_for_completion(show_output=True) 

This is the contents of the output directory after the run:

In [None]:
print(run.get_file_names())

Lets also register our model to the workspace so that we can retrieve it later for testing and deployment:

In [None]:
# register model 
model = run.register_model(model_name='LogRegModel.pkl', model_path='outputs/LogRegModel.pkl')
print(model.name, model.id, model.version, sep='\t')

Now, just running this model once, with no validation, no parameter tuning or testing out other algorithms to see if they perform better is not something we would to in reality - but for now, lets pretend we're satisfied and wants others to be able to use our model in a real world scenario. Then we need to deploy our model to a web service running in a container so that it can be consumed from other applications.

For that we need:
* A scoring script to show how to use the model
* An environment file to show what packages need to be installed
* A configuration file to build the ACI
* The model we trained before

Again, we will be creating the scoring script inline for visibility, called score.py. It is used by the web service call to show how to use the model.

You must include two required functions into the scoring script:
* The `init()` function, which typically loads the model into a global object. This function is run only once when the Docker container is started. 

* The `run(input_data)` function uses the model to predict a value based on the input data. Inputs and outputs to the run typically use JSON for serialization and de-serialization, but other formats are supported.

In [None]:
deploy_folder = os.path.join(os.getcwd(), 'deploy-model')

In [None]:
%%writefile $deploy_folder/score.py
import os
import pickle
import json
import numpy as np
from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression
from azureml.core.model import Model
from azureml.core import model


def init():
    global model
    # AZUREML_MODEL_DIR is an environment variable created during deployment.
    # It is the path to the model folder (./azureml-models/$MODEL_NAME/$VERSION)
    # For multiple models, it points to the folder containing all deployed models (./azureml-models)
    model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'LogRegModel.pkl')
    # deserialize the model file back into a sklearn model
    model = joblib.load(model_path)


# note you can pass in multiple rows for scoring
def run(raw_data):
    try:
        data = json.loads(raw_data)['data']
        data = np.array(data)
        result = model.predict([data])

        # you can return any data type as long as it is JSON-serializable
        return result.tolist()
    except Exception as e:
        result = str(e)
        return result

Next, create an environment file, called myenv.yml, that specifies all of the script's package dependencies. This file is used to ensure that all of those dependencies are installed in the Docker image. This model needs `scikit-learn` and `azureml-sdk`.

In [None]:
from azureml.core.conda_dependencies import CondaDependencies 

myenv = CondaDependencies()
myenv.add_pip_package("scikit-learn==0.20.1")
myenv.add_pip_package("azureml-defaults")
myenv.add_pip_package('azureml-dataprep[pandas]')

with open("./deploy-model/myenv.yml","w") as f:
    f.write(myenv.serialize_to_string())

Create a deployment configuration file and specify the number of CPUs and gigabyte of RAM needed for your ACI container.

In [None]:
from azureml.core.webservice import AciWebservice

aciconfig = AciWebservice.deploy_configuration(cpu_cores=1, 
                                               memory_gb=1, 
                                               tags={"data": "SecBugDataset",  "method" : "sklearn"}, 
                                               description='Predict Security Bugs with sklearn')

Configure the image and deploy. The following code goes through these steps:

1. Create environment object containing dependencies needed by the model using the environment file (`myenv.yml`)
1. Create inference configuration necessary to deploy the model as a web service using:
   * The scoring file (`score.py`)
   * environment object created in previous step
1. Deploy the model to the ACI container.
1. Get the web service HTTP endpoint.

In [None]:
%%time
from azureml.core.webservice import Webservice
from azureml.core.model import InferenceConfig, Model
from azureml.core.environment import Environment

scorefile = os.path.join(os.getcwd(), 'deploy-model','score.py')
myenvfile = os.path.join(os.getcwd(), 'deploy-model','myenv.yml')

myenv = Environment.from_conda_specification(name="myenv", file_path=myenvfile)
inference_config = InferenceConfig(entry_script=scorefile, environment=myenv)

service = Model.deploy(workspace=ws, 
                       name='secbug-sklearn-logreg-svc-4', 
                       models=[model], 
                       inference_config=inference_config, 
                       deployment_config=aciconfig)

service.wait_for_deployment(show_output=True)

Get the scoring web service's HTTP endpoint, which accepts REST client calls. This endpoint can be shared with anyone who wants to test the web service or integrate it into an application:

In [None]:
print(service.scoring_uri)

Now we can test the deployed model:

In [None]:
i = 0
for x in y_test:
    print(x,' ',i)
    i = i+1

In [None]:
import json

service = ws.webservices['secbug-sklearn-logreg-svc-3']



# scrape the first row from the test set.
test_samples = json.dumps({"data": [X_test[62].tolist()]})
print(y_test[62])

#score on our service
service.run(input_data = test_samples)