# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [1]:
from azureml.data.dataset_factory import TabularDatasetFactory
from azureml.core.workspace import Workspace
from azureml.core.compute import ComputeTarget
from azureml.core.compute.amlcompute import AmlCompute
from azureml.exceptions import ComputeTargetException
from azureml.core.experiment import Experiment
from azureml.core.model import Model

from azureml.core import Environment
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice
from azureml.core.webservice import Webservice

from azureml.train.automl.automlconfig import AutoMLConfig
from azureml.widgets.run_details import RunDetails

from azureml.automl.core.shared import constants

from pprint import pprint
import pandas as pd

import logging
import joblib

import json
import requests

## Dataset

### Overview
TODO: In this markdown cell, give an overview of the dataset you are using. Also mention the task you will be performing.

Dat:[ Breast Cancer Prediction Dataset](https://www.kaggle.com/merishnasuwal/breast-cancer-prediction-dataset)

This machine learning program detects the presence (or absence) of breast cancer from pertinent data regarding physical characteristics.
An understanding of the data can be had at https://www.kaggle.com/merishnasuwal/breast-cancer-prediction-dataset/discussion/66975#509394


TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

The dataset is external. It is manually downloaded as a csv and then uploaded to a publicly acccessible github account:
'https://github.com/dntrply/nd00333-capstone/blob/master/dataset/Breast_cancer_data.csv'

In [2]:
ds = TabularDatasetFactory.from_delimited_files('https://github.com/dntrply/nd00333-capstone/raw/master/dataset/Breast_cancer_data.csv')
df = ds.to_pandas_dataframe()
df

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,diagnosis
0,17.99,10.38,122.80,1001.0,0.11840,0
1,20.57,17.77,132.90,1326.0,0.08474,0
2,19.69,21.25,130.00,1203.0,0.10960,0
3,11.42,20.38,77.58,386.1,0.14250,0
4,20.29,14.34,135.10,1297.0,0.10030,0
...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0
565,20.13,28.25,131.20,1261.0,0.09780,0
566,16.60,28.08,108.30,858.1,0.08455,0
567,20.60,29.33,140.10,1265.0,0.11780,0


In [3]:
df.describe()

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,diagnosis
count,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.627417
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.483918
min,6.981,9.71,43.79,143.5,0.05263,0.0
25%,11.7,16.17,75.17,420.3,0.08637,0.0
50%,13.37,18.84,86.24,551.1,0.09587,1.0
75%,15.78,21.8,104.1,782.7,0.1053,1.0
max,28.11,39.28,188.5,2501.0,0.1634,1.0


In [4]:
# Split the dtaaset so that a small fraction may be used for prediction
train_ds, predict_ds = ds.random_split(percentage=0.99, seed=44)

In [5]:
train_df = train_ds.to_pandas_dataframe()
train_df

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,diagnosis
0,17.99,10.38,122.80,1001.0,0.11840,0
1,20.57,17.77,132.90,1326.0,0.08474,0
2,19.69,21.25,130.00,1203.0,0.10960,0
3,11.42,20.38,77.58,386.1,0.14250,0
4,20.29,14.34,135.10,1297.0,0.10030,0
...,...,...,...,...,...,...
560,21.56,22.39,142.00,1479.0,0.11100,0
561,20.13,28.25,131.20,1261.0,0.09780,0
562,16.60,28.08,108.30,858.1,0.08455,0
563,20.60,29.33,140.10,1265.0,0.11780,0


In [6]:
ws = Workspace.from_config()

# choose a name for experiment
experiment=Experiment(ws, 'experiment-capstone-automl')  # Experiment name in Azure ML

In [7]:
# Next, let's use if it exists, or create if required, a compute cluster to be used by the ML

# Access the compute cluster. If it exists, we will have the compute object. 
# If it does not exist, an exception will be thrown upon which the compute cluster is created
try:
    cc = ComputeTarget(workspace=ws, name='COMPUTE-AUTOML')
except ComputeTargetException:
    # Failed to obtain the compute cluster object
    # In all likelihood, a compute cluster of that name has not been created
    # Attempt to create the compute cluster
    # First set up the configuration

    # Specify the configuration of the compute cluster
    cc_cfg = AmlCompute.provisioning_configuration(vm_size='Standard_DS3_v2', min_nodes=1, max_nodes=4)
    cc = ComputeTarget.create(workspace=ws, name='COMPUTE-AUTOML', provisioning_configuration=cc_cfg)

# At this point - we have access to the compute cluster object. Wait for the compute target to complete provisioing
cc.wait_for_completion(show_output='True')

InProgress....
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded................
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## AutoML Configuration

TODO: Explain why you chose the automl settings and cofiguration you used below.

This project is a classification issue. More so, it is a binary classification issue as teh outcome is whether the wine is of a good quality or not.

AUC_weighted is an apporpriate metric to target for a binary classification.
[Set up AutoML training with Python](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train)

It is generally recommended to enable early stopping as it is possible that after a while no further improvement in the model is feasible.

There is enrally limited to no benefit to using a large number of cross validations. In this instance, we have set it to 3.

In [8]:
# TODO: Put your automl settings here

automl_settings = {
    "iterations" : 20,
    "experiment_timeout_minutes" : 30,
    "enable_early_stopping" : True,
    "iteration_timeout_minutes" : 5,
    "max_concurrent_iterations" : 5,
    "max_cores_per_iteration" : -1,
    "n_cross_validations" : 3,
    "primary_metric" : 'AUC_weighted',
    "verbosity" : logging.INFO,
}

# Provide the remainder of the settings/configuration
# Note that we are not providing a validation data set
# 


# TODO: Put your automl config here
automl_config = AutoMLConfig(
    compute_target = cc,
    task='classification',
    training_data=train_ds,
    label_column_name='diagnosis',
    featurization='auto',
    model_explainability=True,
    debug_log='capstone_automl.log',
    **automl_settings)

In [9]:
# TODO: Submit your experiment
automl_run = experiment.submit(automl_config)

Submitting remote run.


Experiment,Id,Type,Status,Details Page,Docs Page
experiment-capstone-automl,AutoML_0bf03af8-8d8d-4440-8bab-bd567c1496ea,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation


## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [10]:
RunDetails(automl_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

In [11]:
automl_run.wait_for_completion(show_output=True)

Experiment,Id,Type,Status,Details Page,Docs Page
experiment-capstone-automl,AutoML_0bf03af8-8d8d-4440-8bab-bd567c1496ea,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation



Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetCrossValidationSplit. Beginning model selection.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

****************************************************************************************************

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were detected in the training data.
              Learn more about missing value imputation: https://aka.ms/AutomatedMLFeaturization

*************************************************************************************

{'runId': 'AutoML_0bf03af8-8d8d-4440-8bab-bd567c1496ea',
 'target': 'COMPUTE-AUTOML',
 'status': 'Completed',
 'startTimeUtc': '2021-11-03T23:19:19.336307Z',
 'endTimeUtc': '2021-11-03T23:31:33.128161Z',
 'services': {},
 'properties': {'num_iterations': '20',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'AUC_weighted',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '3',
  'target': 'COMPUTE-AUTOML',
  'DataPrepJsonString': '{\\"training_data\\": {\\"datasetId\\": \\"9be33acf-4421-4aa4-8b66-4f563a35eb4e\\"}, \\"datasets\\": 0}',
  'EnableSubsampling': 'False',
  'runTemplate': 'AutoML',
  'azureml.runsource': 'automl',
  'display_task_type': 'classification',
  'dependencies_versions': '{"azureml-widgets": "1.34.0", "azureml-train": "1.34.0", "azureml-train-restclients-hyperdrive": "1.34.0", "azureml-train-core": "1.34.0", "azureml-train-automl": "1.34.0", "azureml-train-automl-runtime": "1.34.0", "azureml-train-au

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [14]:
def print_model(model, prefix=""):
    for step in model.steps:
        print(prefix + step[0])
        if hasattr(step[1], 'estimators') and hasattr(step[1], 'weights'):
            pprint({'estimators': list(e[0] for e in step[1].estimators), 'weights': step[1].weights})
            print()
            for estimator in step[1].estimators:
                print_model(estimator[1], estimator[0]+ ' - ')
        elif hasattr(step[1], '_base_learners') and hasattr(step[1], '_meta_learner'):
            print("\nMeta Learner")
            pprint(step[1]._meta_learner)
            print()
            for estimator in step[1]._base_learners:
                print_model(estimator[1], estimator[0]+ ' - ')
        else:
            pprint(step[1].get_params())
            print()

In [15]:
automl_best_run, automl_best_model = automl_run.get_output()

automl_best_run_metrics = automl_best_run.get_metrics()

print(f'********** Best AutoML accuracy: {automl_best_run_metrics.get("accuracy")}')
print(f'********** printing Best AutoML run:\n{automl_best_run}\n\nPrinting model:')

print_model(automl_best_model)

Package:azureml-automl-runtime, training version:1.35.1, current version:1.34.0
Package:azureml-core, training version:1.35.0.post1, current version:1.34.0
Package:azureml-dataprep, training version:2.23.2, current version:2.22.2
Package:azureml-dataprep-rslex, training version:1.21.2, current version:1.20.1
Package:azureml-dataset-runtime, training version:1.35.0, current version:1.34.0
Package:azureml-defaults, training version:1.35.0, current version:1.34.0
Package:azureml-interpret, training version:1.35.0, current version:1.34.0
Package:azureml-mlflow, training version:1.35.0, current version:1.34.0
Package:azureml-pipeline-core, training version:1.35.0, current version:1.34.0
Package:azureml-responsibleai, training version:1.35.0, current version:1.34.0
Package:azureml-telemetry, training version:1.35.0, current version:1.34.0
Package:azureml-train-automl-client, training version:1.35.0, current version:1.34.0
Package:azureml-train-automl-runtime, training version:1.35.1, current

********** Best AutoML accuracy: 0.9309730196255019
********** printing Best AutoML run:
Run(Experiment: experiment-capstone-automl,
Id: AutoML_0bf03af8-8d8d-4440-8bab-bd567c1496ea_18,
Type: azureml.scriptrun,
Status: Completed)

Printing model:
datatransformer
{'enable_dnn': False,
 'enable_feature_sweeping': True,
 'feature_sweeping_config': {},
 'feature_sweeping_timeout': 86400,
 'featurization_config': None,
 'force_text_dnn': False,
 'is_cross_validation': True,
 'is_onnx_compatible': False,
 'observer': None,
 'task': 'classification',
 'working_dir': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/notebook162691/code/Users/odl_user_162691'}

prefittedsoftvotingclassifier
{'estimators': ['10', '1', '8', '0', '3', '11', '6'],
 'weights': [0.2857142857142857,
             0.07142857142857142,
             0.07142857142857142,
             0.07142857142857142,
             0.07142857142857142,
             0.14285714285714285,
             0.2857142857142857]}

10 - standardscaler

In [16]:
print(automl_run.get_metrics())

{'experiment_status': ['DatasetEvaluation', 'FeaturesGeneration', 'DatasetFeaturization', 'DatasetFeaturizationCompleted', 'DatasetCrossValidationSplit', 'ModelSelection', 'BestRunExplainModel', 'ModelExplanationDataSetSetup', 'PickSurrogateModel', 'EngineeredFeatureExplanations', 'EngineeredFeatureExplanations', 'RawFeaturesExplanations', 'RawFeaturesExplanations', 'BestRunExplainModel'], 'experiment_status_description': ['Gathering dataset statistics.', 'Generating features for the dataset.', 'Beginning to fit featurizers and featurize the dataset.', 'Completed fit featurizers and featurizing the dataset.', 'Generating individually featurized CV splits.', 'Beginning model selection.', 'Best run model explanations started', 'Model explanations data setup completed', 'Choosing LightGBM as the surrogate model for explanations', 'Computation of engineered features started', 'Computation of engineered features completed', 'Computation of raw features started', 'Computation of raw features

In [17]:
# Create the outputs directory
if 'outputs' not in os.listdir():
    os.mkdir('outputs')

In [18]:
#TODO: Save the best model
joblib.dump(automl_best_model, os.path.join('outputs','best_automl.pkl'))

['outputs/best_automl.pkl']

## Model Deployment

Remember you have to deploy only one of the two models you trained but you still need to register both the models. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

In [19]:
# Refer - https://docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy-and-where?tabs=python

# Tutorial: Deploy an image classification model in Azure Container Instances -
# https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-deploy-models-with-aml

# Register the model
# registered_model = automl_run.register_model(model_name='wine-taste-automl', description=c_constants.DEPLOYED_AUTOML_MODEL_DESCRIPTION)
registered_model = automl_best_run.register_model(model_path=constants.MODEL_PATH, 
                                                model_name='breast-cancer-automl', 
                                                description='Breast Cancer detection using Azure AutoML',
                                                tags={'Method of execution':'AutoML'},
                                                properties={'Accuracy':automl_best_run_metrics['accuracy']})
print(f'{automl_run.model_id}')
print(f'{registered_model.name}  {registered_model.id}  {registered_model.version}')


None
breast-cancer-automl  breast-cancer-automl:1  1


In [None]:
# download the scoring file and the environmrnt file

automl_best_run.download_file(constants.SCORING_FILE_PATH, os.path.join('outputs', 'scoring.py'))
automl_best_run.download_file(constants.CONDA_ENV_FILE_PATH, os.path.join('outputs', 'best_run_environment.yml'))

In [None]:
# Create an inference config

inference_config = InferenceConfig(
    environment=Environment.from_conda_specification(name='myenv', file_path=os.path.join('outputs', 'best_run_environment.yml')),
    source_directory='outputs',
    entry_script='scoring.py',
)

aci_config = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=1)


In [None]:

service = Model.deploy(workspace=ws,
                       name='breast-cancer-service',
                       models=[registered_model],
                       inference_config=inference_config,
                       deployment_config=aci_config,
                       overwrite=True)
service.wait_for_deployment(show_output=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2021-10-31 19:38:14+00:00 Creating Container Registry if not exists..
2021-10-31 19:48:15+00:00 Registering the environment.
2021-10-31 19:48:16+00:00 Building image..
2021-10-31 19:59:21+00:00 Generating deployment configuration..
2021-10-31 19:59:23+00:00 Submitting deployment to compute..
2021-10-31 19:59:28+00:00 Checking the status of deployment breast-cancer-service..
2021-10-31 20:02:36+00:00 Checking the status of inference endpoint breast-cancer-service.
Succeeded
ACI service creation operation finished, operation "Succeeded"


In [None]:
logs = service.get_logs()

for line in logs.split('\n'):
    print(line)


2021-10-31T20:01:47,419255300+00:00 - iot-server/run 
2021-10-31T20:01:47,425648700+00:00 - gunicorn/run 
Dynamic Python package installation is disabled.
Starting HTTP server
2021-10-31T20:01:47,445270800+00:00 - rsyslog/run 
2021-10-31T20:01:47,448327200+00:00 - nginx/run 
rsyslogd: /azureml-envs/azureml_50e1173ccd55f57fca3555fc5c69dc71/lib/libuuid.so.1: no version information available (required by rsyslogd)
EdgeHubConnectionString and IOTEDGE_IOTHUBHOSTNAME are not set. Exiting...
2021-10-31T20:01:47,763428400+00:00 - iot-server/finish 1 0
2021-10-31T20:01:47,769369700+00:00 - Exit code 1 is normal. Not restarting iot-server.
Starting gunicorn 20.1.0
Listening at: http://127.0.0.1:31311 (71)
Using worker: sync
worker timeout is set to 300
Booting worker with pid: 98
SPARK_HOME not set. Skipping PySpark Initialization.
Generating new fontManager, this may take some time...
Initializing logger
2021-10-31 20:01:50,243 | root | INFO | Starting up app insights client
logging socket was 

TODO: In the cell below, send a request to the web service you deployed to test it.

In [None]:
# To enable ApplicationInsights on the service (webservice), 
# * first access the endpoint using the name assigned at the time of deployment
# * next update webservice parameters such as enabling application insights (enable_app_insights)

webservice = Webservice(
    workspace = ws,
    name='breast-cancer-service'
)

webservice.update(
    enable_app_insights=True
)

# At this point application insights (logging is enabled) and can be
# checked in the GUI in AutoML studio

In [None]:
# URL for the web service, should be similar to:
# 'http://8530a665-66f3-49c8-a953-b82a2d312917.eastus.azurecontainer.io/score'

# From the tail end of the code at
# https://docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy-and-where?tabs=python
# - Deploy machine learning models to Azure

scoring_uri = webservice.scoring_uri

# If the service is authenticated, set the key or token
# key, _ = webservice.get_keys()

# Set the appropriate headers
headers = {"Content-Type": "application/json"}
# headers["Authorization"] = f"Bearer {key}"


In [None]:
predict_data = predict_ds.to_pandas_dataframe()
predict_label = predict_data.pop('diagnosis')

In [None]:
# Convert to JSON string
score_data = json.dumps({'data': predict_data.to_dict(orient='records')})

score_data

'{"data": [{"mean_radius": 9.504, "mean_texture": 12.44, "mean_perimeter": 60.34, "mean_area": 273.9, "mean_smoothness": 0.1024}, {"mean_radius": 15.37, "mean_texture": 22.76, "mean_perimeter": 100.2, "mean_area": 728.2, "mean_smoothness": 0.092}, {"mean_radius": 21.09, "mean_texture": 26.57, "mean_perimeter": 142.7, "mean_area": 1311.0, "mean_smoothness": 0.1141}, {"mean_radius": 11.04, "mean_texture": 14.93, "mean_perimeter": 70.67, "mean_area": 372.7, "mean_smoothness": 0.07987}]}'

In [None]:
# Set the content type
headers = {'Content-Type': 'application/json'}
# If authentication is enabled, set the authorization header
# headers['Authorization'] = f'Bearer {key}'

# Make the request and display the predictions
resp = requests.post(scoring_uri, score_data, headers=headers)
print(f'{resp.json()}')

# Print the actual diagnosis
print(f'{json.dumps({"labels": predict_label.tolist()})}')

{"result": [1, 0, 0, 1]}
{"labels": [1, 0, 0, 1]}


TODO: In the cell below, print the logs of the web service and delete the service

In [None]:
logs = webservice.get_logs()

for line in logs.split('\n'):
    print(line)



2021-10-31T20:01:47,419255300+00:00 - iot-server/run 
2021-10-31T20:01:47,425648700+00:00 - gunicorn/run 
Dynamic Python package installation is disabled.
Starting HTTP server
2021-10-31T20:01:47,445270800+00:00 - rsyslog/run 
2021-10-31T20:01:47,448327200+00:00 - nginx/run 
rsyslogd: /azureml-envs/azureml_50e1173ccd55f57fca3555fc5c69dc71/lib/libuuid.so.1: no version information available (required by rsyslogd)
EdgeHubConnectionString and IOTEDGE_IOTHUBHOSTNAME are not set. Exiting...
2021-10-31T20:01:47,763428400+00:00 - iot-server/finish 1 0
2021-10-31T20:01:47,769369700+00:00 - Exit code 1 is normal. Not restarting iot-server.
Starting gunicorn 20.1.0
Listening at: http://127.0.0.1:31311 (71)
Using worker: sync
worker timeout is set to 300
Booting worker with pid: 98
SPARK_HOME not set. Skipping PySpark Initialization.
Generating new fontManager, this may take some time...
Initializing logger
2021-10-31 20:01:50,243 | root | INFO | Starting up app insights client
logging socket was 

In [None]:
# Clean up any resources
# Delete the Webservice
# delete the compute cluster

webservice.delete()
cc.delete()

WebserviceException: WebserviceException:
	Message: There is a deployment operation in flight for the Service: breast-cancer-service
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "There is a deployment operation in flight for the Service: breast-cancer-service"
    }
}

**Submission Checklist**
- I have registered the model.
- I have deployed the model with the best accuracy as a webservice.
- I have tested the webservice by sending a request to the model endpoint.
- I have deleted the webservice and shutdown all the computes that I have used.
- I have taken a screenshot showing the model endpoint as active.
- The project includes a file containing the environment details.
