# Tutorial:  Create automated ML experiments model 

In the Tutorial #1 predict-emailservice-xgboost-part1.ipynb, you trained machine learning models and  registered a model in your workspace on the cloud. This tutorial will go through the steps to [configure automated ML experiments in Python](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train) instead of running your own model. There is another available option to [configure automated ML experiments in studio](https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-first-experiment-automated-ml) which does not require coding experiences.
      
        
In this tutorial, you use Azure Machine Learning service to:
* Retrieve the dataset from your workspace
* Create a configuration setting for automated ML experiment
* Run the automated ML experiment
* Register the automated ML model                 

For a more detailed overview of automated ML, you can check out in [medium](https://medium.com/microsoftazure/a-review-of-azure-automated-machine-learning-automl-5d2f98512406) for a rough idea.

## Connect Azure Machine Learning Workspace

Create a workspace object from the existing workspace. `Workspace.from_config()` reads the file **config.json** and loads the details into an object named `workspace`.

If you see this message:
"Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code &lt;token\&gt; to authenticate."
    
Click on the link and use the &lt;token\&gt; given to authenticate. After authenticated, run this script again to get load the Workspace.&lt;/token\&gt;&lt;/token\&gt;

In [1]:
# Load workspace configuration from the config.json file in the current folder.
from azureml.core import Workspace
workspace = Workspace.from_config()
# print(workspace.name, workspace.location, workspace.resource_group, workspace.location, sep='\t')


### Create or Attach existing AmlCompute
A compute target is required to execute the Automated ML run. In this tutorial, you create AmlCompute as your training compute resource.
Creation of AmlCompute takes approximately 5 minutes.
If the AmlCompute with that name is already in your workspace this code will skip the creation process. As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. 

In [2]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget

# choose a name for your cluster
compute_name = os.environ.get("AML_COMPUTE_CLUSTER_NAME", "cpucluster")
compute_min_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MIN_NODES", 0)
compute_max_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MAX_NODES", 2)

# This example uses CPU VM. For using GPU VM, set SKU to STANDARD_NC6
vm_size = os.environ.get("AML_COMPUTE_CLUSTER_SKU", "STANDARD_DS2_V2")


if compute_name in workspace.compute_targets:
    compute_target = workspace.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('found compute target. just use it. ' + compute_name)
else:
    print('creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size=vm_size,
                                                                min_nodes=compute_min_nodes,
                                                                max_nodes=compute_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(
        workspace, compute_name, provisioning_config)

    # can poll for a minimum number of nodes and for a specific timeout.
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(
        show_output=True, min_node_count=None, timeout_in_minutes=20)

    # For a more detailed view of current AmlCompute status, use get_status()
    print(compute_target.get_status().serialize())

# Use "cpucluster"
aml_compute = workspace.compute_targets["cpucluster"]

creating a new compute target...
Creating...
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2021-04-26T01:27:24.864000+00:00', 'errors': None, 'creationTime': '2021-04-26T01:27:22.457607+00:00', 'modifiedTime': '2021-04-26T01:27:37.878100+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 2, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_DS2_V2'}


## Import Azure Machine Learning SDK for Python 

This step is to test you have installed Azure Machine Learning SDK for Python. Most of the coding will required the use of the Azure ML SDK. 

Display the Azure Machine Learning SDK version.

In [3]:
import azureml.core

# check core SDK version number (need Python 3.6 kernel if you run this in Microsoft Azure Notebooks)
print("Azure ML SDK Version: ", azureml.core.VERSION)

Azure ML SDK Version:  1.24.0


### How to download data from datastore

This section shows how to download a dataset (tabular) that was created in predict-link-xgboost-part1.ipynb.

[Create Azure Machine Learning datasets](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-register-datasets)

In [4]:
from azureml.core import Workspace, Dataset

dataset = Dataset.get_by_name(workspace, name='demo_query_dataset_tabular')
train = dataset.to_pandas_dataframe()
train

Unnamed: 0,Column1,@search.features.keyphrases.similarityScore,@search.features.keyphrases.termFrequency,@search.features.keyphrases.uniqueTokenMatches,@search.features.query.similarityScore,@search.features.query.termFrequency,@search.features.query.uniqueTokenMatches,@search.features.url.similarityScore,@search.features.url.termFrequency,@search.features.url.uniqueTokenMatches,@search.score,AzureSearch_DocumentKey,grade,keyphrases,query,sessionid,url
0,0,0.802591,1.0,1.0,0.802591,1.0,1.0,0.941192,2.0,1.0,2.546375,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,7,['powershell'],powershell,1,https://docs.microsoft.com/en-us/powershell/sc...
1,1,0.802591,1.0,1.0,0.802591,1.0,1.0,0.941192,2.0,1.0,2.546375,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,6,['powershell'],powershell,1,https://docs.microsoft.com/en-us/powershell/sc...
2,2,0.802591,1.0,1.0,0.802591,1.0,1.0,0.856795,1.0,1.0,2.461978,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,9,['powershell'],powershell,1,https://docs.microsoft.com/en-us/powershell/
3,3,0.802591,1.0,1.0,0.802591,1.0,1.0,0.80911,2.0,1.0,2.414292,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,5,['powershell'],powershell,1,https://docs.microsoft.com/en-us/powershell/sc...
4,4,0.802591,1.0,1.0,0.802591,1.0,1.0,0.80911,2.0,1.0,2.414292,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,4,['powershell'],powershell,1,https://docs.microsoft.com/en-us/powershell/sc...
5,5,0.802591,1.0,1.0,0.802591,1.0,1.0,0.80911,2.0,1.0,2.414292,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,2,['powershell'],powershell,1,https://docs.microsoft.com/en-us/powershell/sc...
6,6,0.802591,1.0,1.0,0.802591,1.0,1.0,0.780786,1.0,1.0,2.385969,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,10,['powershell'],powershell,1,https://docs.microsoft.com/en-us/powershell/sc...
7,7,0.802591,1.0,1.0,0.802591,1.0,1.0,0.680645,1.0,1.0,2.285828,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,8,['powershell'],powershell,1,https://docs.microsoft.com/en-us/windows-serve...
8,8,0.802591,1.0,1.0,0.802591,1.0,1.0,0.680645,1.0,1.0,2.285828,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,3,['powershell'],powershell,1,https://docs.microsoft.com/en-us/powershell/az...
9,9,0.802591,1.0,1.0,0.802591,1.0,1.0,0.550636,1.0,1.0,2.155819,aHR0cHM6Ly9jc2lkZW1haWxkYXRhLmJsb2IuY29yZS53aW...,1,['powershell'],powershell,1,https://docs.microsoft.com/en-us/virtualizatio...


### Cleaning the dataset for model training
Defining the columns needed to be in the training set. The target columns has to be in the features as in automl config code, we can only put in our dataframe.

In [5]:
features = ['@search.features.keyphrases.similarityScore',
   '@search.features.keyphrases.termFrequency',
   '@search.features.keyphrases.uniqueTokenMatches',
   '@search.features.query.similarityScore',
   '@search.features.query.termFrequency',
   '@search.features.query.uniqueTokenMatches',
   '@search.features.url.similarityScore',
   '@search.features.url.termFrequency',
   '@search.features.url.uniqueTokenMatches', '@search.score', 'grade']

label_column = 'grade'

train[features]

Unnamed: 0,@search.features.keyphrases.similarityScore,@search.features.keyphrases.termFrequency,@search.features.keyphrases.uniqueTokenMatches,@search.features.query.similarityScore,@search.features.query.termFrequency,@search.features.query.uniqueTokenMatches,@search.features.url.similarityScore,@search.features.url.termFrequency,@search.features.url.uniqueTokenMatches,@search.score,grade
0,0.802591,1.0,1.0,0.802591,1.0,1.0,0.941192,2.0,1.0,2.546375,7
1,0.802591,1.0,1.0,0.802591,1.0,1.0,0.941192,2.0,1.0,2.546375,6
2,0.802591,1.0,1.0,0.802591,1.0,1.0,0.856795,1.0,1.0,2.461978,9
3,0.802591,1.0,1.0,0.802591,1.0,1.0,0.80911,2.0,1.0,2.414292,5
4,0.802591,1.0,1.0,0.802591,1.0,1.0,0.80911,2.0,1.0,2.414292,4
5,0.802591,1.0,1.0,0.802591,1.0,1.0,0.80911,2.0,1.0,2.414292,2
6,0.802591,1.0,1.0,0.802591,1.0,1.0,0.780786,1.0,1.0,2.385969,10
7,0.802591,1.0,1.0,0.802591,1.0,1.0,0.680645,1.0,1.0,2.285828,8
8,0.802591,1.0,1.0,0.802591,1.0,1.0,0.680645,1.0,1.0,2.285828,3
9,0.802591,1.0,1.0,0.802591,1.0,1.0,0.550636,1.0,1.0,2.155819,1


### Configure your Auto ML experiment settings
There are several options that you can use to configure your automated machine learning experiment. These parameters are set by instantiating an AutoMLConfig object. See the [AutoMLConfig class](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig?view=azure-ml-py) for a full list of parameters.

1. Ensemble setting are enabled by default and can be [configured](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train#ensemble-configuration) in automated ML runs.

2. The data to be used as training dataset must be on the local machine (only for local AutoML runs) or in the cloud (Azure Blob Storage, Azure File Share, Azure Data Lake Storage, Azure SQL Database, Azure PostgreSQL Database and Azure MySQL Database). Moreover, the data can be read into a Pandas DataFrame (just for local AutoML) or an Azure Machine Learning TabularDataset (local and remote AutoML).
Automated machine learning supports data that resides on your local desktop or in the cloud such as Azure Blob 
Requirements for training data in machine learning:
   + Data must be in tabular form.
   + The value to predict, target column, must be in the data.


3. Experiment timeout minutes has to be set to more than 15 minutes. 15 minutes is the minimum.

4. Featurization can be set to "auto" to allow AutoML to run some checks on training dataset and create [features](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-features#automatic-featurization) from the training dataset such as imputation of missing datas, .

In [6]:
import logging
from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(task='classification',
                             debug_log='automated_ml_errors.log',
                             training_data= train[features],
                             label_column_name="grade",
                             iteration_timeout_minutes = 10,
                             
                             ### This is a exiting criteria to stop our experiment
                             experiment_timeout_minutes = 15, #experiment_exit_score
                             enable_early_stopping = True,
                             primary_metric = 'AUC_weighted',
                             featurization = 'auto',
                             verbosity = logging.INFO,
                             n_cross_validations = 3,
                             
                             ### Ensemble models are enabled by default, and appear as the final run iterations in an AutoML run. 
#                              enable_voting_ensemble=False,
#                              enable_stack_ensemble=False
                            )

### Create an Experiment

An Experiment tracks the runs in your workspace and it is required to run automated ML runs. Submit the automl_config that is created above to the experiment to start the automated ML  runs.

In [7]:
from azureml.core import Experiment

experiment_name = 'predict-link-automl'
exp = Experiment(workspace=workspace, name=experiment_name)

run = exp.submit(automl_config, show_output=True)

# If you need to retrieve a run that already started, use the following code
#from azureml.train.automl.run import AutoMLRun
#remote_run = AutoMLRun(experiment = exp, run_id = '<replace with your run id>')

No run_configuration provided, running on local with default configuration
Running on local machine
Parent Run ID: AutoML_ec312415-7fd6-470d-80d5-0dd0b823b9ed

Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed fit featurizers and featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

**************************************************

Cannot serialize JSON, possibly due to NaN or Inf, scrubbing to zero and retrying...
Cannot serialize numpy arrays as JSON


Current status: RawFeaturesExplanations. Computation of raw features completed
Current status: BestRunExplainModel. Best run model explanations completed
****************************************************************************************************


### Explore models and metrics
Automated ML offers options for you to monitor and evaluate your training results.

You can view the hyperparameters, the scaling and normalization techniques, and algorithm applied to a specific automated ML run with the following custom code solution.
The following defines the custom method, print_model(), which prints the hyperparameters of each step of the automated ML training pipeline.

For a run that was just submitted and trained from within the same experiment notebook, you can pass in the best model using the **get_output()** method.

In [9]:
from pprint import pprint

def print_model(model, prefix=""):
    for step in model.steps:
        print(prefix + step[0])
        if hasattr(step[1], 'estimators') and hasattr(step[1], 'weights'):
            pprint({'estimators': list(e[0] for e in step[1].estimators), 'weights': step[1].weights})
            print()
            for estimator in step[1].estimators:
                print_model(estimator[1], estimator[0]+ ' - ')
        elif hasattr(step[1], '_base_learners') and hasattr(step[1], '_meta_learner'):
            print("\nMeta Learner")
            pprint(step[1]._meta_learner)
            print()
            for estimator in step[1]._base_learners:
                print_model(estimator[1], estimator[0]+ ' - ')
        else:
            pprint(step[1].get_params())
            print()
            
best_run, fitted_model = run.get_output()
print_model(fitted_model)

datatransformer
{'enable_dnn': None,
 'enable_feature_sweeping': None,
 'feature_sweeping_config': None,
 'feature_sweeping_timeout': None,
 'featurization_config': None,
 'force_text_dnn': None,
 'is_cross_validation': None,
 'is_onnx_compatible': None,
 'logger': None,
 'observer': None,
 'task': None,
 'working_dir': None}

prefittedsoftvotingclassifier
{'estimators': ['3', '9', '24', '28', '2', '26', '15', '23'],
 'weights': [0.07142857142857142,
             0.2857142857142857,
             0.07142857142857142,
             0.07142857142857142,
             0.07142857142857142,
             0.07142857142857142,
             0.14285714285714285,
             0.21428571428571427]}

3 - robustscaler
{'copy': True,
 'quantile_range': [10, 90],
 'with_centering': False,
 'with_scaling': False}

3 - extratreesclassifier
{'bootstrap': False,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'entropy',
 'max_depth': None,
 'max_features': 'log2',
 'max_leaf_nodes': None,
 'max_sampl

### Monitor runs using the Jupyter notebook widget
When you use the ScriptRunConfig method to submit runs, you can watch the progress of the run using the Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes.

In [10]:
from azureml.widgets import RunDetails
RunDetails(run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

You can also get a link to the same display in your workspace instead of using python code above

In [12]:
print(run.get_portal_url())

### View run metrics

In [13]:
from azureml.core import Run

best_run_metrics = best_run.get_metrics() # or other runs with runID

for metric_name in best_run_metrics:
    metric = best_run_metrics[metric_name]
    print(metric_name, ':', metric)

balanced_accuracy : 0.2833333333333333
precision_score_macro : 0.23833333333333337
accuracy : 0.25633528265107214
AUC_weighted : 0.7721704520014634
average_precision_score_macro : 0.47054281109836665
norm_macro_recall : 0.20833333333333334
precision_score_weighted : 0.278411306042885
f1_score_micro : 0.25633528265107214
recall_score_micro : 0.25633528265107214
f1_score_weighted : 0.23784925276153346
average_precision_score_weighted : 0.4922245098414688
precision_score_micro : 0.25633528265107214
recall_score_weighted : 0.25633528265107214
f1_score_macro : 0.2233333333333333
AUC_macro : 0.7780358223536328
recall_score_macro : 0.2833333333333333
AUC_micro : 0.7116523349381323
average_precision_score_micro : 0.31681756790638854
log_loss : 2.112302938320622
weighted_accuracy : 0.22361888764464352
matthews_correlation : 0.2016284359533019
confusion_matrix : aml://artifactId/ExperimentRun/dcid.AutoML_ec312415-7fd6-470d-80d5-0dd0b823b9ed_30/confusion_matrix
accuracy_table : aml://artifactId/E

### Test the fitted model
Now that the model is trained, we can use it to do prediction. In this case, I will just use the training data.

In [15]:
test_features = features.copy()
test_features.remove('grade')
fitted_model.predict(train.loc[train['query']=='powershell'][test_features])

array([ 6,  6, 10,  2,  2,  2, 10, 10, 10,  1])

### Register and deploy models
You can register a model, so you can come back to it for later use.
To register a model from an automated ML run, use the register_model() method.

In [16]:
model_name = best_run.properties['model_name']
print(model_name)
description = 'AutoML prediction emailservice example'
metrics = ['f1_score_macro', 'norm_macro_recall', 'AUC_weighted', 'balanced_accuracy', 
           'precision_score_macro', 'log_loss', 'AUC_micro', 'AUC_macro', 'precision_score_micro', 
           'recall_score_micro', 'matthews_correlation', 'recall_score_weighted', 'weighted_accuracy',
           'f1_score_weighted', 'f1_score_micro', 'average_precision_score_macro', 
           'average_precision_score_micro', 'average_precision_score_weighted', 'accuracy',
           'recall_score_macro', 'precision_score_weighted', 'confusion_matrix', 'accuracy_table']

tags = {}
for key in metrics:
    tags[key] = run.get_metrics(key).get(key)

model = run.register_model(model_name = 'predict-link-automlmodel',
#                            model_path='outputs/model.pkl',
                           description = description, 
                           tags = tags)

print(model.name, model.id, model.version, model.tags, sep='\n')

AutoMLec312415730
predict-link-automlmodel
predict-link-automlmodel:1
1
{'f1_score_macro': '0.2233333333333333', 'norm_macro_recall': '0.20833333333333334', 'AUC_weighted': '0.7721704520014634', 'balanced_accuracy': '0.2833333333333333', 'precision_score_macro': '0.23833333333333337', 'log_loss': '2.112302938320622', 'AUC_micro': '0.7116523349381323', 'AUC_macro': '0.7780358223536328', 'precision_score_micro': '0.25633528265107214', 'recall_score_micro': '0.25633528265107214', 'matthews_correlation': '0.2016284359533019', 'recall_score_weighted': '0.25633528265107214', 'weighted_accuracy': '0.22361888764464352', 'f1_score_weighted': '0.23784925276153346', 'f1_score_micro': '0.25633528265107214', 'average_precision_score_macro': '0.47054281109836665', 'average_precision_score_micro': '0.31681756790638854', 'average_precision_score_weighted': '0.4922245098414688', 'accuracy': '0.25633528265107214', 'recall_score_macro': '0.2833333333333333', 'precision_score_weighted': '0.278411306042885

### Retrieve the model

You registered a model in your workspace. Now, load this workspace and download the model to your local directory.

In [17]:
from azureml.core.model import Model
import os 
download_model = Model(workspace,'predict-link-automlmodel') # Default will get the latest version.

download_model.download(target_dir=os.getcwd(), exist_ok=True)
print(download_model)

file_path = os.path.join(os.getcwd(), "model.pkl")

Model(workspace=Workspace.create(name='csidmlws', subscription_id='ebe8d9fa-67d0-4af1-bce2-4a5b07e50a42', resource_group='cmt-202011001'), name=predict-link-automlmodel, id=predict-link-automlmodel:1, version=1, tags={'f1_score_macro': '0.2233333333333333', 'norm_macro_recall': '0.20833333333333334', 'AUC_weighted': '0.7721704520014634', 'balanced_accuracy': '0.2833333333333333', 'precision_score_macro': '0.23833333333333337', 'log_loss': '2.112302938320622', 'AUC_micro': '0.7116523349381323', 'AUC_macro': '0.7780358223536328', 'precision_score_micro': '0.25633528265107214', 'recall_score_micro': '0.25633528265107214', 'matthews_correlation': '0.2016284359533019', 'recall_score_weighted': '0.25633528265107214', 'weighted_accuracy': '0.22361888764464352', 'f1_score_weighted': '0.23784925276153346', 'f1_score_micro': '0.25633528265107214', 'average_precision_score_macro': '0.47054281109836665', 'average_precision_score_micro': '0.31681756790638854', 'average_precision_score_weighted': '0

### Predict test data

Feed the test dataset to the model to get predictions.

In [19]:
import joblib # Use this to load the model that was created earlier on
automl_model = joblib.load(file_path)
print(automl_model)

test_features = features.copy()
test_features.remove('grade')
automl_model.predict(train.loc[train['query']=='powershell'][test_features])

Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('prefittedsoftvotingclassifier',...
                                                                                        degree=3,
                                                                                        gamma='scale',
                                                                                        kernel='rbf',
                                                                                        m

array([ 6,  6, 10,  2,  2,  2, 10, 10, 10,  1])