# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

## Import dependencies

### Import libraries

In [1]:
# Azure Environment libraries
from azureml.core import Environment

# Azure dataset libraries
from azureml.data.dataset_factory import TabularDatasetFactory
from azureml.core import Dataset, Datastore

# Azure workspace and experiment Libraries
from azureml.core import Workspace, Experiment

# Azure compute cluster libraries
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException

# Azure train and run libraries
from azureml.core.run import Run
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun
from azureml.widgets import RunDetails

# Azure deployment libraries
from azureml.core.model import InferenceConfig, Model
from azureml.core.webservice import AciWebservice, Webservice


# ONNX libraries
from azureml.automl.runtime.onnx_convert import OnnxConverter

# OS libraries
import os
import shutil
import requests
import json

## Dataset

### Overview

TODO: In this markdown cell, give an overview of the dataset you are using. Also mention the task you will be performing.  

### Dataset analysis

The dataset selected for the project is the UCI [Estimation of Obesity Levels Data Set](https://archive.ics.uci.edu/ml/datasets/Estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition+).   
The dataset includes data for the estimation of obesity levels in individuals from the countries of Mexico, Peru, and Colombia, based on their eatin habits and physical condition.  
The data contains 17 attributes and 2,111 records. The records are labeled with the class variable NObesity (Obesity Level), that allows classification of the data.  

The attributes are:
1. Gender: (categorical: Female, Male)
2. Age: (numerical)
3. Height: (numerical)
4. Weight: (numerical)
5. family_history_with_overweight: categorical (yes, no)
6. FAVC: frequent consumption of high caloric food (categorical: yes, no)
7. FCVC: frequency of consumption of vegetables (numerical)
8. NCP: number of main meals (numerical)
9. CAEC: consumption of food between meals (categorical: Always, Frequently, no, Sometimes)
10. SMOKE: if the person smokes or no (categorical: yes, no)
11. CH20: comsumption of water daily (numerical)
12. SCC: calories consumption monitoring (categorical: yes. no)
13. FAF: physical activity frequency (numerical)
14. TUE: time using technology devices (numerical)
15. CALC: consumption of alcohol (categorical: categorical: Always, Frequently, no, Sometimes)
16. MTRANS: transportation used (categorical: Automobile, Bike, Motorbike, Public_Transportation, Walking)

The desired target is:  
17. NObeyesdad (categorical: Insufficient_Weight, Normal_Weight, Overweight_Level_I, Overweight_Level_II, Obesity_Type_I, Obesity_Type_II, Obesity_Type_III).  

In the [01. Exploratory Data Analysis notebook](01.%20Exploratory%20Data%20Analysis.ipynb), different analysis were conducted on the dataset to be used in the project.  

TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

In [2]:
# Create dataset folder
path = '../'
if 'dataset' not in os.listdir(path):
    os.mkdir('../dataset')

In [3]:
# Download dataset
!wget -O ObesityDataSet.zip  'https://archive.ics.uci.edu/ml/machine-learning-databases/00544/ObesityDataSet_raw_and_data_sinthetic%20(2).zip'

--2021-01-17 02:48:47--  https://archive.ics.uci.edu/ml/machine-learning-databases/00544/ObesityDataSet_raw_and_data_sinthetic%20(2).zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 119205 (116K) [application/x-httpd-php]
Saving to: ‘ObesityDataSet.zip’


2021-01-17 02:48:47 (654 KB/s) - ‘ObesityDataSet.zip’ saved [119205/119205]



In [4]:
# Unzip dataset
!unzip ObesityDataSet.zip

Archive:  ObesityDataSet.zip
  inflating: ObesityDataSet_raw_and_data_sinthetic.arff  
  inflating: ObesityDataSet_raw_and_data_sinthetic.csv  


In [5]:
# Move dataset to folder
!mv ObesityDataSet* ../dataset/

### Initialize workspace and experiment

In [3]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'capstone-project'

experiment=Experiment(ws, experiment_name)

In [4]:
# Print workspace settings
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

Workspace name: udacity-ws
Azure region: eastus
Subscription id: 4ee8335e-198c-4d15-b7f8-70b9f3a46669
Resource group: udacity-rg


### Load dataset

In [5]:
# Get AzureBlob data store
datastore = Datastore.get(ws, 'workspaceblobstore')

# Upload files to data store
datastore.upload_files(files = ['../dataset/ObesityDataSet_raw_and_data_sinthetic.csv'],
                       target_path = 'capstone-dataset/',
                       overwrite = True,
                       show_progress = True)

# Create tabular dataset
dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, 'capstone-dataset/ObesityDataSet_raw_and_data_sinthetic.csv')])

Uploading an estimated of 1 files
Uploading ../dataset/ObesityDataSet_raw_and_data_sinthetic.csv
Uploaded ../dataset/ObesityDataSet_raw_and_data_sinthetic.csv, 1 files out of an estimated total of 1
Uploaded 1 files


In [6]:
# Validate dataset load
dataset.take(3).to_pandas_dataframe()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Female,21,1.62,64.0,yes,no,2,3,Sometimes,no,2,no,0,1,no,Public_Transportation,Normal_Weight
1,Female,21,1.52,56.0,yes,no,3,3,Sometimes,yes,3,yes,3,0,Sometimes,Public_Transportation,Normal_Weight
2,Male,23,1.8,77.0,yes,no,2,3,Sometimes,no,2,no,2,1,Frequently,Public_Transportation,Normal_Weight


### Create / attach cluster

In [7]:
cluster_name = 'capstone-cluster'

# Verify that cluster does not exist already
try:
    cluster_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    cluster_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    cluster_target = ComputeTarget.create(ws, cluster_name, cluster_config)

# Set cluster timeout
cluster_target.wait_for_completion(show_output=True, min_node_count = 1, timeout_in_minutes = 10)

# Get cluster status
print(cluster_target.get_status().serialize())

Found existing cluster, use it.
Succeeded.................................................................................................................
AmlCompute wait for completion finished

Wait timeout has been reached
Current provisioning state of AmlCompute is "Succeeded" and current node count is "0"
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2021-01-20T18:52:40.181000+00:00', 'errors': None, 'creationTime': '2021-01-12T03:01:15.044380+00:00', 'modifiedTime': '2021-01-12T03:01:31.649682+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_D2_V2'}


## AutoML Configuration

TODO: Explain why you chose the automl settings and cofiguration you used below.

### Auto ML Settings

* Experiment timeout.  
This is the maximum amount of time in minutes that all iterations combined can take before the experiment terminates.  
If not specified, the default experiment timeout is 6 days.  
It was set to 20 minutes, so the experiment would not run for a long time and consume too many resources.  

* Maximum concurrent iterations.  
This represents the maximum number of iterations that would be executed in parallel.  
The default value is 1.  
AmlCompute clusters support one iteration running per node. For multiple AutoML experiment parent runs executed in parallel on a single AmlCompute cluster, the sum of the values for all experiments should be less than or equal to the maximum number of nodes. Otherwise, runs will be queued until nodes are available.  
It was set to 4, since the cluster was configured to have 4 nodes.  

* Primary Metric.  
To evaluate the performance of the models, *accuracy* was selected.  
Accuracy is a popular choice because it is very easy to understand and explain.  
In this dataset, the identification of positives is not crucial (sensitivity/recall), there is no need to be more confident of the predicted positives (precision), there is no need to cover all true negatives (specificity), and there is not an uneven class distribution (F1).  

References:  
[AutoMLConfig Class](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig?view=azure-ml-py).  
[How to select Performance Metrics for Classification Models](https://medium.com/analytics-vidhya/how-to-select-performance-metrics-for-classification-models-c847fe6b1ea3).  


### Auto ML Config

* Task.  
It was set to *classification*, since the problem to solve is to determine the class of Obesity level based on the attributes provided.  
* Enable ONNX Compatible Models.  
It was set True, to save the model as ONNX
* Featurization.  
It was set to auto, to enable the featurization step, since the dataset is not preprocessed.  

In [8]:
# TODO: Put your automl settings here
automl_settings = {
    'experiment_timeout_minutes': 20,
    'max_concurrent_iterations': 4,
    'primary_metric' : 'accuracy'
}

# TODO: Put your automl config here
automl_config = AutoMLConfig(compute_target=cluster_target,
                             task = 'classification',
                             training_data=dataset,
                             label_column_name='NObeyesdad',   
                             enable_onnx_compatible_models=True,
                             featurization= 'auto',
                             debug_log = 'automl_errors.log',
                             **automl_settings
                            )

In [9]:
# TODO: Submit your experiment
remote_run = experiment.submit(automl_config)

Running on remote.


## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [10]:
RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

![RunDetails 1](../images/AutoMLRun1.png)
![RunDetails 2](../images/AutoMLRun2.png)

In [11]:
# Wait for the remote run to complete
remote_run.wait_for_completion()

{'runId': 'AutoML_ae447fc9-77cd-40a3-b737-aff7e6593944',
 'target': 'capstone-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-01-20T17:54:36.695618Z',
 'endTimeUtc': '2021-01-20T18:36:28.069276Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'accuracy',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': None,
  'target': 'capstone-cluster',
  'DataPrepJsonString': '{\\"training_data\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"4885dbd2-b4b9-4d06-a439-c81487c5623c\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetDatastoreFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"datastores\\\\\\": [{\\\\\\"datastoreName\\\\\\": \\\\\\"workspaceblobstore\\\\\\", \\\\\\"path\\\\\\": \\\\\\"capstone-dataset/ObesityDataSet_raw_and_data_sinthetic.csv\\\\\\", \\\\\\"resourceGroup\\\\\\": \\\\\\"udacity-rg\\\\\\", \\\\\\"subscription\\\\\\": \\\\\\"4ee8335e-198c-4d15

![AutoML RunId](../images/AutoMLRunId1.png)

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [11]:
# Get model explainability
model_explainability_run_id = remote_run.id + "_" + "ModelExplain"
print(model_explainability_run_id)
model_explainability_run = Run(experiment=experiment, run_id=model_explainability_run_id)
model_explainability_run.wait_for_completion()

AutoML_678d9c7a-208b-4cf5-a04d-fc5649e069bc_ModelExplain


{'runId': 'AutoML_678d9c7a-208b-4cf5-a04d-fc5649e069bc_ModelExplain',
 'target': 'capstone-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-01-24T22:54:31.586908Z',
 'endTimeUtc': '2021-01-24T23:05:54.910162Z',
 'properties': {'azureml.runsource': 'automl',
  'parentRunId': 'AutoML_678d9c7a-208b-4cf5-a04d-fc5649e069bc_50',
  '_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '2d622f5e-668a-4801-a235-efb5eabe493b',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json',
  'dependencies_versions': '{"azureml-train-automl-runtime": "1.20.0.post1", "azureml-train-automl-client": "1.20.0", "azureml-telemetry": "1.20.0", "azureml-pipeline-core": "1.20.0", "azureml-model-management-sdk": "1.0.1b6.post1", "azureml-interpret": "1.20.0", "azureml-defaults": "1.20.0", "azureml-dataset-runtime": "1.20.0", "azureml-dataprep": "2.7.3", "azureml-dataprep-rslex": "1.5.0", "azureml-dataprep-native": "27.0.0", "azureml-cor

In [12]:
# Get the best run object
best_run_ml, fitted_model_ml = remote_run.get_output()

In [13]:
print(fitted_model_ml)

Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('prefittedsoftvotingclassifier',...
                                                                                               objective='multi:softprob',
                                                                                               random_state=0,
                                                                                               reg_alpha=0,
                                                  

![AutoML Best Model 1](../images/AutoMLRunChildId1.png)

In [14]:
#TODO: Save the best model
# Retrieve best ONNX model
best_run_onnx, onnx_mdl = remote_run.get_output(return_onnx_model=True)

In [15]:
# Save best ONNX model
onnx_fl_path = "./model_ml.onnx"
OnnxConverter.save_onnx_model(onnx_mdl, onnx_fl_path)

In [16]:
# Create model folder
os.makedirs('./model_ml', exist_ok=True)

In [17]:
# Move model to folder
!mv model_ml.onnx ./model_ml/

Se han movido         1 archivos.


## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

### Model Accuracy Comparison

#### Hyperparameter Tuning Model
![Hyperparameter Best Model 2](../images/HyperparameterRunChildId2.png)

#### AutoML Model
![AutoML Best Model 3](../images/AutoMLRunChildId1.png)

Since the AutoML generated model obtained a better accuracy, that model was deployed.

In [18]:
# Get scoring script
script_file_name = 'inference/score.py'
best_run_ml.download_file('outputs/scoring_file_v_1_0_0.py', 'inference/score.py')

In [19]:
# Register model
registered_model = remote_run.register_model(model_name = 'capstone-model')

![Registered Model](../images/RegisteredModel.png)

In [20]:
# Get registered model path
model_path = Model.get_model_path(model_name = 'capstone-model', _workspace = ws)

In [21]:
# Create inference config
inference_config = InferenceConfig(entry_script=script_file_name)

In [22]:
# Deploy web service
aciconfig = AciWebservice.deploy_configuration(cpu_cores = 1, 
                                               memory_gb = 1)

aci_service_name = 'capstone-service'
print(aci_service_name)

aci_service = Model.deploy(ws, aci_service_name, [registered_model], inference_config, aciconfig)
aci_service.wait_for_deployment(True)
print(aci_service.state)

capstone-service
Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running.....................................
Succeeded
ACI service creation operation finished, operation "Succeeded"
Healthy


![Endpoint](../images/Endpoint.png)

TODO: In the cell below, send a request to the web service you deployed to test it.

In [23]:
# Prepare datasets to score
data = {"data":
        [
          {
            "Gender": "Female",
            "Age": 21,
            "Height": 1.62,
            "Weight": 64,
            "family_history_with_overweight": "yes",
            "FAVC": "no",
            "FCVC": 2,
            "NCP": 3,
            "CAEC": "Sometimes",
            "SMOKE": "no",
            "CH2O": 2,
            "SCC": "no",
            "FAF": 0,
            "TUE": 1,
            "CALC": "no",
            "MTRANS": "Public_Transportation"
          },
          {
            "Gender": "Male",
            "Age": 27,
            "Height": 1.8,
            "Weight": 87,
            "family_history_with_overweight": "no",
            "FAVC": "no",
            "FCVC": 3,
            "NCP": 3,
            "CAEC": "Sometimes",
            "SMOKE": "no",
            "CH2O": 2,
            "SCC": "no",
            "FAF": 2,
            "TUE": 0,
            "CALC": "Frequently",
            "MTRANS": "Walking"
          },
      ]
    }

In [24]:
# Convert to JSON string
input_data = json.dumps(data)
with open("data.json", "w") as _f:
    _f.write(input_data)

In [25]:
# Set the content type
headers = {'Content-Type': 'application/json'}

In [26]:
# Make the request and display the response
resp = requests.post(aci_service.scoring_uri, input_data, headers=headers)
print(resp.json())

{"result": ["Normal_Weight", "Overweight_Level_I"]}


TODO: In the cell below, print the logs of the web service and delete the service

In [27]:
# Get logs
aci_service.get_logs()



In [28]:
# Delete service
aci_service.delete()

![Endpoint Deletion](../images/EndpointDeletion.png)