## Automated ML

This project is part of the Udacity Azure ML Nanodegree.

In [1]:
import azureml.core

print("SDK version:", azureml.core.VERSION)

SDK version: 1.26.0


### Overview
Please refer to the Github README file for a comprehensive overview of the project, including all details regarding the dataset.

As this is a Mercedes-Benz used car price prediction project, I will be performing an Azure AutoML Regressor in order to retrieve the best model for a price prediction.

Steps in this notebook include:
- Experiment
- Compute
- Dataset
- AutoML Configuration
- Run Details
- Best Model
- Model Depyloment

### Experiment

Creates the experiment called 'mercedes-price-prediction-experiment'.

In [2]:
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace

ws = Workspace.from_config()

experiment_name = 'mercedes-price-prediction'

experiment = Experiment(ws, experiment_name)

print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')
print('\n')
print(experiment)

udacity-ws
udacity-rg
westeurope
939d1c66-7864-4f15-8560-5c793c4110c8


Experiment(Name: mercedes-price-prediction,
Workspace: udacity-ws)


### Compute

Chooses the already existing compute cluster.

If it didn't exist, it'll be created instead.

In [3]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "mercedes-cc"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                              max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


### Dataset

Loads the already registered dataset called 'mercedes'.

Prints out if dataset was found or not.

In [4]:
found = False
key = "mercedes"

if key in ws.datasets.keys():
    found = True
    dataset = ws.datasets[key]
    print("dataset found")

if not found:
    print("dataset not found")

dataset found


Prints out an overview of the dataset to ensure the quality of the dataset. For instance, 13.119 datapoints are available in each column. The price ranges from 650 to 159.999 British Pounds. This sounds reasonable for a Mercedes-Benz car, considering the average year of registration (2017).

In [5]:
df = dataset.to_pandas_dataframe()

df.describe()

Unnamed: 0,year,price,mileage,tax,mpg,engineSize
count,13119.0,13119.0,13119.0,13119.0,13119.0,13119.0
mean,2017.296288,24698.59692,21949.559037,129.972178,55.155843,2.07153
std,2.224709,11842.675542,21176.512267,65.260286,15.220082,0.572426
min,1970.0,650.0,1.0,0.0,1.1,0.0
25%,2016.0,17450.0,6097.5,125.0,45.6,1.8
50%,2018.0,22480.0,15189.0,145.0,56.5,2.0
75%,2019.0,28980.0,31779.5,145.0,64.2,2.1
max,2020.0,159999.0,259000.0,580.0,217.3,6.2


Prints out the first 5 rows of the dataset.

In [6]:
df.head(5)

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize
0,SLK,2005,5200,Automatic,63000,Petrol,325,32.1,1.8
1,S Class,2017,34948,Automatic,27000,Hybrid,20,61.4,2.1
2,SL CLASS,2016,49948,Automatic,6200,Petrol,555,28.0,5.5
3,G Class,2016,61948,Automatic,16000,Petrol,325,30.4,4.0
4,G Class,2016,73948,Automatic,4000,Petrol,325,30.1,4.0


### AutoML Configuration
Considering the overall high quality of the dataset, I chose to timeout the experiment after 15 minutes and to enable early stopping. Since the price column is a numeric value, it makes sense to use an easy to understand metric such as a (normalized) mean absolute error as the primary metric.

As I want to predict numerical values (prices in British Pounds), it make sense to use a regression as task.

In [7]:
from azureml.train.automl import AutoMLConfig

# automl settings
automl_settings = {
    "experiment_timeout_minutes": 15,
    "max_concurrent_iterations": 5,
    "primary_metric": "normalized_mean_absolute_error",
    "featurization": 'auto',
    "enable_early_stopping": True,
}

# automl config
automl_regressor = AutoMLConfig(
    compute_target=cpu_cluster,
    task="regression",
    training_data=dataset,
    label_column_name="price",
    **automl_settings
    )

### Run Details

Shows the different experiments.

In [8]:
from azureml.widgets import RunDetails

# Submit experiment
automl_run = experiment.submit(automl_regressor, show_output=True)

RunDetails(automl_run).show()
automl_run.wait_for_completion(show_output=True)

Submitting remote run.
No run_configuration provided, running on mercedes-cc with default configuration
Running on remote compute: mercedes-cc


Experiment,Id,Type,Status,Details Page,Docs Page
mercedes-price-prediction,AutoML_b88cfcdf-7f5e-447d-8a2f-0f268afeff64,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation



Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Cross validation
STATUS:       DONE
DESCRIPTION:  Each iteration of the trained model was validated through cross-validation.
              
DETAILS:      
+---------------------------------+
|Number of folds                  |
|3                                |
+---------------------------------+

****************************************************************************************************

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were detected in the training data.
              Learn more about missing value imputation: https://aka.ms/AutomatedMLFeaturizat

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…



****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Cross validation
STATUS:       DONE
DESCRIPTION:  Each iteration of the trained model was validated through cross-validation.
              
DETAILS:      
+---------------------------------+
|Number of folds                  |
|3                                |
+---------------------------------+

****************************************************************************************************

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were detected in the training data.
              Learn more about missing value imputation: https://aka.ms/AutomatedMLFeaturization

****************************************************************************************************

TYPE:         High cardinality feature detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and n

Experiment,Id,Type,Status,Details Page,Docs Page
mercedes-price-prediction,AutoML_b88cfcdf-7f5e-447d-8a2f-0f268afeff64,automl,Completed,Link to Azure Machine Learning studio,Link to Documentation


              Learn more about high cardinality feature handling: https://aka.ms/AutomatedMLFeaturization

****************************************************************************************************

****************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
****************************************************************************************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   MaxAbsScaler LightGBM                          0:02:28       0.0113    0.0113
         1   MaxAbsScaler XGBoostRegressor                  0:00:45       0.0151    0.0113
         2   StandardScalerWrapper DecisionTree           

{'runId': 'AutoML_b88cfcdf-7f5e-447d-8a2f-0f268afeff64',
 'target': 'mercedes-cc',
 'status': 'Completed',
 'startTimeUtc': '2021-04-20T12:00:07.116127Z',
 'endTimeUtc': '2021-04-20T12:17:49.838759Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'normalized_mean_absolute_error',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': None,
  'target': 'mercedes-cc',
  'DataPrepJsonString': '{\\"training_data\\": {\\"datasetId\\": \\"f3d238d6-2159-4501-b5bf-6a2d75e77de5\\"}, \\"datasets\\": 0}',
  'EnableSubsampling': None,
  'runTemplate': 'AutoML',
  'azureml.runsource': 'automl',
  'display_task_type': 'regression',
  'dependencies_versions': '{"azureml-widgets": "1.26.0", "azureml-train": "1.26.0", "azureml-train-restclients-hyperdrive": "1.26.0", "azureml-train-core": "1.26.0", "azureml-train-automl": "1.26.0", "azureml-train-automl-runtime": "1.26.0", "azureml-train-automl-clie

### Best Model

Gets the best model from the automl experiment and displays all the properties of the model.

In [9]:
best_run, best_model = automl_run.get_output()
print('\n')
print(best_run)
print('\n')
print(best_model)



Run(Experiment: mercedes-price-prediction,
Id: AutoML_b88cfcdf-7f5e-447d-8a2f-0f268afeff64_0,
Type: azureml.scriptrun,
Status: Completed)


RegressionPipeline(pipeline=Pipeline(memory=None,
                                     steps=[('datatransformer',
                                             DataTransformer(enable_dnn=None,
                                                             enable_feature_sweeping=None,
                                                             feature_sweeping_config=None,
                                                             feature_sweeping_timeout=None,
                                                             featurization_config=None,
                                                             force_text_dnn=None,
                                                             is_cross_validation=None,
                                                             is_onnx_compatible=None,
                                                 

Prints out all available metrics of the best model.

In [10]:
automl_run_metrics = automl_run.get_metrics()
automl_run_metrics

{'experiment_status': ['DatasetEvaluation',
  'FeaturesGeneration',
  'DatasetFeaturization',
  'DatasetFeaturizationCompleted',
  'DatasetCrossValidationSplit',
  'ModelSelection',
  'BestRunExplainModel',
  'ModelExplanationDataSetSetup',
  'PickSurrogateModel',
  'EngineeredFeatureExplanations',
  'EngineeredFeatureExplanations',
  'RawFeaturesExplanations',
  'RawFeaturesExplanations',
  'BestRunExplainModel'],
 'experiment_status_description': ['Gathering dataset statistics.',
  'Generating features for the dataset.',
  'Beginning to fit featurizers and featurize the dataset.',
  'Completed fit featurizers and featurizing the dataset.',
  'Generating individually featurized CV splits.',
  'Beginning model selection.',
  'Best run model explanations started',
  'Model explanations data setup completed',
  'Choosing LightGBM as the surrogate model for explanations',
  'Computation of engineered features started',
  'Computation of engineered features completed',
  'Computation of ra

Prints out the best rund id and the mean absolute error of the best model.

In [11]:
print('Best Run ID:', automl_run.id)
print('Mean Absolute Error:', automl_run_metrics['mean_absolute_error'])

Best Run ID: AutoML_b88cfcdf-7f5e-447d-8a2f-0f268afeff64
Mean Absolute Error: 1801.3507542906725


Saves the best model as pkl file.

In [12]:
#Save the best model

import joblib  
joblib.dump(best_model, "model.pkl")

['model.pkl']

### Model Deployment

Registers best model.

In [13]:
model_name = best_run.properties['model_name']
model_name

'AutoMLb88cfcdf70'

In [14]:
model = automl_run.register_model(model_name=model_name)

print(model.name, model.id, model.version)

AutoMLb88cfcdf70 AutoMLb88cfcdf70:1 1


Creates an inference config and deploys the model as a web service.

In [16]:
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice
from azureml.core.environment import Environment
from azureml.core.model import Model

service_name = 'mercedes-predictor'
env = best_run.get_environment()
env.save_to_directory(path="Users/info.cz")

script_file = 'score.py'
best_run.download_file('outputs/scoring_file_v_1_0_0.py', script_file)

inference_config = InferenceConfig(entry_script=script_file, environment=env)
aci_config = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=1)

service = Model.deploy(workspace=ws,
                       name=service_name,
                       models=[model],
                       inference_config=inference_config,
                       deployment_config=aci_config)

service.wait_for_deployment(show_output=True)

print("State: ",service.state)
print("S-URI: ",service.scoring_uri)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2021-04-20 12:23:17+00:00 Creating Container Registry if not exists..
2021-04-20 12:23:33+00:00 Registering the environment..
2021-04-20 12:23:34+00:00 Use the existing image.
2021-04-20 12:23:34+00:00 Generating deployment configuration.
2021-04-20 12:23:35+00:00 Submitting deployment to compute..
2021-04-20 12:23:41+00:00 Checking the status of deployment mercedes-predictor..
2021-04-20 12:26:49+00:00 Checking the status of inference endpoint mercedes-predictor.
Succeeded
ACI service creation operation finished, operation "Succeeded"
State:  Healthy
S-URI:  http://a56d1b11-1bde-4589-bf40-cd1683522c97.westeurope.azurecontainer.io/score


Sends a request to the deployed web service to test it.

In [40]:
import requests
import json

# URL for the web service
scoring_uri = 'http://a56d1b11-1bde-4589-bf40-cd1683522c97.westeurope.azurecontainer.io/score'

# Two sets of data to score, so we get two results back
data = {"data":
        [
          {
                "model": "G Class",
                "year": 2016,
                "transmission": "Automatic",
                "mileage": 16000,
                "fuelType": "Petrol",
                "tax": 325,
                "mpg": 30.4,
                "engineSize": 4.0
          },
          {
                "model": "G Class",
                "year": 2016,
                "transmission": "Automatic",
                "mileage": 100000,
                "fuelType": "Petrol",
                "tax": 325,
                "mpg": 30.4,
                "engineSize": 4.0
          },
      ]
    }

# Convert to JSON string
input_data = json.dumps(data)

# Set the content type
headers = {'Content-Type': 'application/json'}

# Make the request and display the response
resp = requests.post(scoring_uri, input_data, headers=headers)
print(resp.text)

"{\"result\": [56265.4312647257, 24857.945435516085]}"


Prints the logs of the web service.

In [41]:
service.get_logs()



Deletes the service.

In [None]:
service.delete()