# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [1]:
#!pip install --upgrade azureml-automl-core

In [2]:
#!pip install --upgrade azureml-sdk[automl]

In [3]:
# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.20.0


Import Libraries

In [4]:
import logging
import os
import csv
import pandas as pd
import numpy as np
import json
import requests
import joblib
from sklearn.metrics import confusion_matrix
import itertools

from azureml.core import Dataset, Workspace, Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.widgets import RunDetails
from azureml.train.automl import AutoMLConfig

from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice, Webservice
from azureml.core.model import Model
from azureml.core.environment import Environment

## Dataset

### Overview
TODO: In this markdown cell, give an overview of the dataset you are using. Also mention the task you will be performing.


TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

The dataset obtained from Kaggle and was originally from the National Institute of Diabetes and Digestive and Kidney Diseases contain medical records of patients to predict the propensity of a patient having diabetes. Diabetes is a metabolic disease that causes high blood sugar over a prolong period of time. Some of the common symptoms include frequent urination, increased thirst and appetite (Wikipedia).

The following medical features were used to predict whether a patient will be diabetic or not:
1.	Pregnancies - Number of times pregnant
2.	Glucose - Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3.	Blood pressure - Diastolic blood pressure (mm Hg)
4.	Skin thickness - Triceps skin fold thickness (mm)
5.	Insulin – 2 hour serum insulin (muU/ml)
6.	BMI – body mass index(weight in kg)/(height in m)^2
7.	Diabetes pedigree – diabetes pedigree function
8.	Age – age of patient (years) and 
9.	Outcome – class variable (0 or 1) 268 of 768 are 1 and the others are 0.

In this project, the azure automl was used to find the best model that will predict the outcome, whether a patient will have diabetes or not based on 9 medical records measured from each patient.

## Workspace setup

In [5]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'automl-diabetes-experiment'

experiment=Experiment(ws, experiment_name)
experiment

Name,Workspace,Report Page,Docs Page
automl-diabetes-experiment,quick-starts-ws-136277,Link to Azure Machine Learning studio,Link to Documentation


In [6]:
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

Workspace name: quick-starts-ws-136277
Azure region: southcentralus
Subscription id: b968fb36-f06a-4c76-a15f-afab68ae7667
Resource group: aml-quickstarts-136277


## Config Compute Cluster

In [7]:
# create or attach an existing compute cluster
cpu_cluster_name = "cpu-cluster"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D12_V2',
                                                           max_nodes=5)
    compute_target = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

Creating
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Data

In [8]:
dataset = Dataset.get_by_name(ws, 'Diabetes-dataset')

In [9]:
df = dataset.to_pandas_dataframe()
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


### Review the Dataset Result

You can peek the result of a TabularDataset at any range using skip(i) and take(j).to_pandas_dataframe(). Doing so evaluates only j records for all the steps in the TabularDataset, which makes it fast even against large datasets.

TabularDataset objects are composed of a list of transformation steps (optional).

In [10]:
dataset.take(5).to_pandas_dataframe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [11]:
train_data, test_data = dataset.random_split(0.9)

## AutoML Configuration

TODO: Explain why you chose the automl settings and cofiguration you used below.

In [12]:
# TODO: Put your automl settings here
automl_settings = {
    "experiment_timeout_minutes": 30,
    "max_concurrent_iterations": 4,
    "primary_metric" : 'accuracy',
    "n_cross_validations": 5
}

# TODO: Put your automl config here
automl_config = AutoMLConfig(compute_target=compute_target,
                             task = "classification",
                             training_data=train_data,
                             label_column_name="Outcome", 
                             enable_early_stopping= True,
                             featurization= 'auto',
                             **automl_settings
                            )

In [13]:
# TODO: Submit your experiment
remote_run = experiment.submit(automl_config, show_output = True)

Running on remote.
No run_configuration provided, running on cpu-cluster with default configuration
Running on remote compute: cpu-cluster
Parent Run ID: AutoML_6d887152-79c9-442a-a385-9472117cfed6

Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

****************************************************************************************************

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values we

## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [14]:
remote_run

Experiment,Id,Type,Status,Details Page,Docs Page
automl-diabetes-experiment,AutoML_6d887152-79c9-442a-a385-9472117cfed6,automl,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [15]:
RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

In [16]:
remote_run.wait_for_completion()

{'runId': 'AutoML_6d887152-79c9-442a-a385-9472117cfed6',
 'target': 'cpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-01-27T15:44:39.943763Z',
 'endTimeUtc': '2021-01-27T16:06:33.707523Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'accuracy',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '5',
  'target': 'cpu-cluster',
  'DataPrepJsonString': '{\\"training_data\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"6d336b5b-ac50-4ce4-88a5-24b56b6463f1\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetDatastoreFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"datastores\\\\\\": [{\\\\\\"datastoreName\\\\\\": \\\\\\"workspaceblobstore\\\\\\", \\\\\\"path\\\\\\": \\\\\\"UI/01-27-2021_034209_UTC/diabetes.csv\\\\\\", \\\\\\"resourceGroup\\\\\\": \\\\\\"aml-quickstarts-136277\\\\\\", \\\\\\"subscription\\\\\\": \\\\\\"b968fb36-f06a-4c76-a15f-afab68ae7667\\

## Create an enviroment

In [17]:
%%writefile conda_dependencies.yml

dependencies:
- python=3.6.2
- pip=20.2.4
- pip:
  - azureml-core==1.20.0
  - azureml-pipeline-core==1.20.0
  - azureml-telemetry==1.20.0
  - azureml-defaults==1.20.0
  - azureml-interpret==1.20.0
  - azureml-automl-core==1.20.0
  - azureml-automl-runtime==1.20.0
  - azureml-train-automl-client==1.20.0
  - azureml-train-automl-runtime==1.20.0.post1
  - azureml-dataset-runtime==1.20.0
  - inference-schema
  - py-cpuinfo==5.0.0
  - boto3==1.15.18
  - botocore==1.18.18
- numpy~=1.18.0
- scikit-learn==0.22.1
- pandas~=0.25.0
- fbprophet==0.5
- holidays==0.9.11
- setuptools-git
- psutil>5.0.0,<6.0.0

Writing conda_dependencies.yml


In [18]:
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies

env = Environment.get(workspace=ws, name="AzureML-AutoML")

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [19]:
best_run, fitted_model = remote_run.get_output()
best_run_metrics = best_run.get_metrics()


In [20]:
best_run

Experiment,Id,Type,Status,Details Page,Docs Page
automl-diabetes-experiment,AutoML_6d887152-79c9-442a-a385-9472117cfed6_36,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [21]:
fitted_model

Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('prefittedsoftvotingclassifier',...
                                                                                               tree_method='auto',
                                                                                               verbose=-10,
                                                                                               verbosity=0))],
                                                          

In [22]:
print('Best Run Id: ', best_run.id)
print('\n Accuracy:', best_run_metrics['accuracy'])
print(fitted_model._final_estimator)
print(best_run.get_tags())

Best Run Id:  AutoML_6d887152-79c9-442a-a385-9472117cfed6_36

 Accuracy: 0.7859701492537313
PreFittedSoftVotingClassifier(classification_labels=None,
                              estimators=[('1',
                                           Pipeline(memory=None,
                                                    steps=[('maxabsscaler',
                                                            MaxAbsScaler(copy=True)),
                                                           ('xgboostclassifier',
                                                            XGBoostClassifier(base_score=0.5,
                                                                              booster='gbtree',
                                                                              colsample_bylevel=1,
                                                                              colsample_bynode=1,
                                                                              colsample_bytree=1,
         

In [23]:
#TODO: Save the best model
os.makedirs('./outputs', exist_ok=True)

joblib.dump(fitted_model, filename='outputs/automl.joblib')

model_name = best_run.properties['model_name']
model_name

'AutoML6d887152736'

In [24]:
env = best_run.get_environment()

script_file = 'score.py'

best_run.download_file('outputs/scoring_file_v_1_0_0.py', script_file)

## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

In [25]:
#Register the fitted model
model = remote_run.register_model(model_name = model_name,
                                  description = 'AutoML model')

TODO: In the cell below, send a request to the web service you deployed to test it.

In [26]:
inference_config = InferenceConfig(entry_script = script_file, environment = env)

aci_config = AciWebservice.deploy_configuration(cpu_cores = 1, memory_gb = 1)

aci_service_name = 'automl-diabetes'
print(aci_service_name)

automl-diabetes


In [27]:
service = Model.deploy(ws, aci_service_name, [model], inference_config, aci_config)
service.wait_for_deployment(True)
print("State: " + service.state)
print("Scoring URI: " + service.scoring_uri)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running..............................................................
Succeeded
ACI service creation operation finished, operation "Succeeded"
State: Healthy
Scoring URI: http://41cbb5ad-1b92-4bf3-aead-e134d7a13903.southcentralus.azurecontainer.io/score


## Testing Runs

In [29]:
%run endpoint.py

{"result": [1, 0]}


In [30]:
test_data = test_data.to_pandas_dataframe().dropna()
data_sample = test_data.sample(2)
y_true = data_sample.pop('Outcome')
sample_json = json.dumps({'data':data_sample.to_dict(orient='records')})
print(sample_json)

{"data": [{"Pregnancies": 1, "Glucose": 91, "BloodPressure": 64, "SkinThickness": 24, "Insulin": 0, "BMI": 29.2, "DiabetesPedigreeFunction": 0.192, "Age": 21}, {"Pregnancies": 8, "Glucose": 110, "BloodPressure": 76, "SkinThickness": 0, "Insulin": 0, "BMI": 27.8, "DiabetesPedigreeFunction": 0.237, "Age": 58}]}


In [31]:
output = service.run(sample_json)
print('Prediction: ', output)
print('True Values: ', y_true.values)

Prediction:  {"result": [0, 0]}
True Values:  [0 0]


TODO: In the cell below, print the logs of the web service and delete the service

In [32]:
service.get_logs()



In [33]:
service.delete()