# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

### Import All Required Dependencies

In [8]:
# System libraries
import os
import csv
import shutil
import logging

# Conda libraries
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import pkg_resources

# Azure core libraries
import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset

# Computer target core libraries
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException


# Libraries for Visualizing run
from azureml.widgets import RunDetails

#library for Saving model
import joblib

# Clean data function defined in train.py script
from train import clean_data 

# ONNX libraries
import sys
import json
from azureml.automl.core.onnx_convert import OnnxConvertConstants
from azureml.train.automl import constants

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.20.0


### Initialize Workspace
Initialize a workspace object from persisted configuration. Make sure the config file is present at .\config.json

In [2]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

quick-starts-ws-137067
aml-quickstarts-137067
southcentralus
aa7cf8e8-d23f-4bce-a7b9-1f0b4e0ac8ee


### Setup Experiment 
Here we will be creating an experiment named "heart-disease-automl".  

In [3]:
# choose a name for experiment
experiment_name = 'liver-disease-automl'

experiment=Experiment(ws, experiment_name)

### Create or Attach an AmlCompute cluster
You will need to create a compute target for your AutoML run. In this demo, you get the default AmlCompute as your training compute resource.

In [4]:
#Create compute cluster

# Choose a name for your CPU cluster
cpu_cluster_name = "notebook137067"
# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS3_V2',
                                                           max_nodes=4)
    compute_target = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

Found existing cluster, use it.

Running


## Dataset

TODO: In this markdown cell, give an overview of the dataset you are using. Also mention the task you will be performing.
TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

### Overview
There is an increasing number of patients with liver disease in recent time due to life style and living habits such as excessive alcohol consumption, inhale of harmful gases, excessive weight gain, intake of contaminated food, abuse of drugs. This dataset is aimed at helping doctors during clinical diagnosis of liver disease to elevate burden and the stress involved in analyzing every single patients’ information. Therefore, the goal is to create a classifier that predicts whether a subject is healthy (non-liver patient) or ill (liver patient) based on some clinical and demographic features which are: age, gender, total Bilirubin, direct Bilirubin, total proteins, albumin, A/G ratio, SGPT, SGOT and Alkphos.

In [5]:
#load the liver dataset to datastore
data_path = "https://raw.githubusercontent.com/chollette/Liver-Disease-Classification-Azure-ML-Capstone-Project/master/starter_file/data/Liver%20Patient%20Dataset%20(LPD)_train.csv"
dataset = pd.read_csv(data_path)

In [6]:
# Use the clean_data function to clean your data.
x, y = clean_data(dataset)
train_data = pd.concat([x, y], axis=1, sort=False)
#upload the cleaned marketing data to the default datastore (blob) of my workspace.

#first convert data to .csv
train_data.to_csv('train_data.csv',header=True)

#Then upload to datastore
datastore = ws.get_default_datastore()
datastore.upload_files(['train_data.csv'], target_path='', overwrite=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Uploading an estimated of 1 files
Uploading train_data.csv
Uploaded train_data.csv, 1 files out of an estimated total of 1
Uploaded 1 files


$AZUREML_DATAREFERENCE_workspaceblobstore

In [9]:
#convert back to tabular dataset for running in AutoML
train_data = Dataset.Tabular.from_delimited_files(path = [(datastore, 'train_data.csv')])
label = "Result"

## AutoML Configuration

TODO: Explain why you chose the automl settings and cofiguration you used below.

In [10]:
# TODO: Put your automl settings here
automl_settings = {   
    "experiment_timeout_hours": 1,
    #"experiment_timeout_minutes": 30,
    "enable_early_stopping" : True,
    "model_explainability" : True,
    "iteration_timeout_minutes": 5,
    "max_concurrent_iterations": 5,
    "max_cores_per_iteration": -1,
    "n_cross_validations": 10,
    "primary_metric": 'accuracy',
    "featurization": 'auto',
    "verbosity": logging.INFO,
}

# TODO: Put your automl config here
automl_config = AutoMLConfig(task = 'classification',
                             compute_target=compute_target,
                             training_data = train_data,
                             label_column_name = label,
                             debug_log = "automl_errors.log",
                             **automl_settings
                            )

In [11]:
# TODO: Submit your experiment
remote_run = experiment.submit(automl_config)

Running on remote.


## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [12]:
#Visualize experiment
RunDetails(remote_run).show()


_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

In [13]:
remote_run.wait_for_completion()

{'runId': 'AutoML_504cb5fa-cd54-4ce5-a418-95cf7d6c6e96',
 'target': 'notebook137067',
 'status': 'Completed',
 'startTimeUtc': '2021-02-03T09:57:01.324013Z',
 'endTimeUtc': '2021-02-03T10:19:48.633216Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'accuracy',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '10',
  'target': 'notebook137067',
  'DataPrepJsonString': '{\\"training_data\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"513109ba-aed7-40a6-9f69-b3a6dac1b99b\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetDatastoreFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"datastores\\\\\\": [{\\\\\\"datastoreName\\\\\\": \\\\\\"workspaceblobstore\\\\\\", \\\\\\"path\\\\\\": \\\\\\"train_data.csv\\\\\\", \\\\\\"resourceGroup\\\\\\": \\\\\\"aml-quickstarts-137067\\\\\\", \\\\\\"subscription\\\\\\": \\\\\\"aa7cf8e8-d23f-4bce-a7b9-1f0b4e0ac8ee\\\\\\", \\\\\\"wo

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [14]:
# Retrieve and save your best automl model.
best_automl_run_metrics = remote_run.get_metrics()
print(best_automl_run_metrics)

{'experiment_status': ['DatasetEvaluation', 'FeaturesGeneration', 'DatasetFeaturization', 'DatasetFeaturizationCompleted', 'DatasetCrossValidationSplit', 'ModelSelection'], 'experiment_status_description': ['Gathering dataset statistics.', 'Generating features for the dataset.', 'Beginning to fit featurizers and featurize the dataset.', 'Completed fit featurizers and featurizing the dataset.', 'Generating individually featurized CV splits.', 'Beginning model selection.'], 'average_precision_score_micro': 0.9997427104307498, 'log_loss': 0.2206832707310776, 'balanced_accuracy': 0.9996109644048939, 'f1_score_micro': 0.9997790597699465, 'f1_score_macro': 0.9997281910606806, 'precision_score_macro': 0.9998458051153032, 'f1_score_weighted': 0.9997790024221924, 'weighted_accuracy': 0.9998942174962437, 'recall_score_weighted': 0.9997790597699465, 'AUC_macro': 0.9997405840707959, 'matthews_correlation': 0.999456708922604, 'precision_score_micro': 0.9997790597699465, 'AUC_weighted': 0.9997405840

In [15]:
print("Best AutoML model Accuracy: ", best_automl_run_metrics['accuracy'])

Best AutoML model Accuracy:  0.9997790597699465


In [16]:
#Retrieve model details
best_run, fitted_model = remote_run.get_output()
print(best_run)
print(fitted_model.steps)

Package:azureml-automl-runtime, training version:1.21.0, current version:1.20.0
Package:azureml-core, training version:1.21.0.post1, current version:1.20.0
Package:azureml-dataprep, training version:2.8.2, current version:2.7.3
Package:azureml-dataprep-native, training version:28.0.0, current version:27.0.0
Package:azureml-dataprep-rslex, training version:1.6.0, current version:1.5.0
Package:azureml-dataset-runtime, training version:1.21.0, current version:1.20.0
Package:azureml-defaults, training version:1.21.0, current version:1.20.0
Package:azureml-interpret, training version:1.21.0, current version:1.20.0
Package:azureml-pipeline-core, training version:1.21.0, current version:1.20.0
Package:azureml-telemetry, training version:1.21.0, current version:1.20.0
Package:azureml-train-automl-client, training version:1.21.0, current version:1.20.0
Package:azureml-train-automl-runtime, training version:1.21.0, current version:1.20.0


Run(Experiment: liver-disease-automl,
Id: AutoML_504cb5fa-cd54-4ce5-a418-95cf7d6c6e96_38,
Type: azureml.scriptrun,
Status: Completed)
[('datatransformer', DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                feature_sweeping_config=None, feature_sweeping_timeout=None,
                featurization_config=None, force_text_dnn=None,
                is_cross_validation=None, is_onnx_compatible=None, logger=None,
                observer=None, task=None, working_dir=None)), ('prefittedsoftvotingclassifier', PreFittedSoftVotingClassifier(classification_labels=None,
                              estimators=[('0',
                                           Pipeline(memory=None,
                                                    steps=[('maxabsscaler',
                                                            MaxAbsScaler(copy=True)),
                                                           ('lightgbmclassifier',
                                                  

In [17]:
#TODO: Save the best model
joblib.dump(fitted_model, 'automl-votingEnsemble_model.joblib')

['automl-votingEnsemble_model.joblib']

### Register the Fitted Model for Deployment

In [19]:
#Register model
model = best_run.register_model(model_name='model',
                           model_path='outputs/model.pkl',
                           tags=best_run.get_metrics())
print(model.name, model.id, model.version, sep='\t')

model	model:1	1


In [21]:
compute_target.delete()