# Part 3. AutoML

1. Train an AutoML model locally
2. Customizability of Azure AutoML

## Setup

```
pip install azureml-sdk[automl]

# for mac:
brew install lightgbm
brew install libomp
```

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from tqdm import tqdm
import logging

## Workspace

In [2]:
import os

subscription_id = os.getenv("SUBSCRIPTION_ID")
resource_group = os.getenv("RESOURCE_GROUP")
workspace_name = os.getenv("WORKSPACE_NAME")
workspace_region = os.getenv("WORKSPACE_REGION")

from azureml.core import Workspace
ws = Workspace(subscription_id = subscription_id, resource_group = resource_group, workspace_name = workspace_name)

# Train an AutoML model locally

In [3]:
df = pd.read_csv('data/train.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
from azureml.train.automl import AutoMLConfig
# https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train#configure-your-experiment-settings
# https://docs.microsoft.com/en-us/python/api/azureml-train-automl/azureml.train.automl.automlconfig?view=azure-ml-py
automl_config = AutoMLConfig(
    task='classification',
    training_data=df,
    label_column_name='Survived',
    primary_metric='AUC_weighted',
    n_cross_validations=2,
    
    # Try for at most 10 minutes, then give up.
    experiment_timeout_minutes=1,
    
    # Featurization
    preprocess=True,
    
    debug_log="automated_ml_errors.log",
    verbosity=logging.INFO)

In [5]:
from azureml.core import Experiment
experiment = Experiment(ws, "titanic-automl-local-22")
local_run = experiment.submit(automl_config, show_output=True)

Running on local machine
Parent Run ID: AutoML_f270975e-99da-4b04-8835-4d7e4860afe0
Current status: DatasetFeaturization. Beginning to featurize the dataset.
Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturizationCompleted. Completed featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.

****************************************************************************************************
DATA GUARDRAILS SUMMARY:
For more details, use API: run.get_guardrails()

TYPE:         Class Balancing Detection
STATUS:       PASSED
DESCRIPTION:  Classes are balanced in the training data.

TYPE:         Missing Values Imputation
STATUS:       FIXED
DESCRIPTION:  The training data had the following missing values which were resolved.

Please review your data source for data quality issues and possibly filter out the rows w

In [6]:
local_run

Experiment,Id,Type,Status,Details Page,Docs Page
titanic-automl-local-22,AutoML_f270975e-99da-4b04-8835-4d7e4860afe0,automl,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [7]:
from azureml.widgets import RunDetails
RunDetails(local_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 'sâ€¦

In [8]:
best_run, fitted_model = local_run.get_output()
# print(best_run)
print(fitted_model)

Pipeline(memory=None,
     steps=[('datatransformer', DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
        feature_sweeping_config=None, feature_sweeping_timeout=None,
        featurization_config=None, is_cross_validation=None,
        is_onnx_compatible=None, logger=None, observer=None, task=None)), ('pref...      weights=[0.35714285714285715, 0.21428571428571427, 0.2857142857142857, 0.14285714285714285]))])


In [9]:
fitted_model.named_steps.keys()

dict_keys(['datatransformer', 'prefittedsoftvotingclassifier'])

## Dig into the Featurizations done

In [10]:
summary_ = fitted_model.named_steps['datatransformer'].get_featurization_summary()

from pprint import pprint
pprint(summary_)

[{'Dropped': 'No',
  'EngineeredFeatureCount': 2,
  'RawFeatureName': 'Age',
  'Transformations': ['MeanImputer', 'ImputationMarker'],
  'TypeDetected': 'Numeric'},
 {'Dropped': 'No',
  'EngineeredFeatureCount': 1,
  'RawFeatureName': 'Fare',
  'Transformations': ['MeanImputer'],
  'TypeDetected': 'Numeric'},
 {'Dropped': 'No',
  'EngineeredFeatureCount': 1,
  'RawFeatureName': 'PassengerId',
  'Transformations': ['MeanImputer'],
  'TypeDetected': 'Numeric'},
 {'Dropped': 'Yes',
  'EngineeredFeatureCount': 0,
  'RawFeatureName': 'Cabin',
  'Transformations': [''],
  'TypeDetected': 'Ignore'},
 {'Dropped': 'Yes',
  'EngineeredFeatureCount': 0,
  'RawFeatureName': 'Ticket',
  'Transformations': [''],
  'TypeDetected': 'Ignore'},
 {'Dropped': 'No',
  'EngineeredFeatureCount': 4,
  'RawFeatureName': 'Embarked',
  'Transformations': ['StringCast-CharGramCountVectorizer'],
  'TypeDetected': 'Categorical'},
 {'Dropped': 'No',
  'EngineeredFeatureCount': 7,
  'RawFeatureName': 'Parch',
  'Tran

In [14]:
fitted_model.named_steps['datatransformer'].get_engineered_feature_names()

['Age_MeanImputer',
 'Age_ImputationMarker',
 'Fare_MeanImputer',
 'PassengerId_MeanImputer',
 'Embarked_CharGramCountVectorizer_c',
 'Embarked_CharGramCountVectorizer_nan',
 'Embarked_CharGramCountVectorizer_q',
 'Embarked_CharGramCountVectorizer_s',
 'Parch_CharGramCountVectorizer_0',
 'Parch_CharGramCountVectorizer_1',
 'Parch_CharGramCountVectorizer_2',
 'Parch_CharGramCountVectorizer_3',
 'Parch_CharGramCountVectorizer_4',
 'Parch_CharGramCountVectorizer_5',
 'Parch_CharGramCountVectorizer_6',
 'Pclass_CharGramCountVectorizer_1',
 'Pclass_CharGramCountVectorizer_2',
 'Pclass_CharGramCountVectorizer_3',
 'Sex_ModeCatImputer_LabelEncoder',
 'SibSp_CharGramCountVectorizer_0',
 'SibSp_CharGramCountVectorizer_1',
 'SibSp_CharGramCountVectorizer_2',
 'SibSp_CharGramCountVectorizer_3',
 'SibSp_CharGramCountVectorizer_4',
 'SibSp_CharGramCountVectorizer_5',
 'SibSp_CharGramCountVectorizer_8',
 'Name_CharGramTfIdf_ "a',
 'Name_CharGramTfIdf_ "b',
 'Name_CharGramTfIdf_ "c',
 'Name_CharGramT

In [13]:
# Copy-pasted this part from the doc.
from pprint import pprint

def print_model(model, prefix=""):
    for step in model.steps:
        print(prefix + step[0])
        if hasattr(step[1], 'estimators') and hasattr(step[1], 'weights'):
            pprint({'estimators': list(
                e[0] for e in step[1].estimators), 'weights': step[1].weights})
            print()
            for estimator in step[1].estimators:
                print_model(estimator[1], estimator[0] + ' - ')
        else:
            pprint(step[1].get_params())
            print()


print_model(fitted_model)

datatransformer
{'enable_dnn': None,
 'enable_feature_sweeping': None,
 'feature_sweeping_config': None,
 'feature_sweeping_timeout': None,
 'featurization_config': None,
 'is_cross_validation': None,
 'is_onnx_compatible': None,
 'logger': None,
 'observer': None,
 'task': None}

prefittedsoftvotingclassifier
{'estimators': ['0', '2', '1', '3'],
 'weights': [0.35714285714285715,
             0.21428571428571427,
             0.2857142857142857,
             0.14285714285714285]}

0 - maxabsscaler
{'copy': True}

0 - lightgbmclassifier
{'boosting_type': 'goss',
 'class_weight': None,
 'colsample_bytree': 0.7922222222222222,
 'importance_type': 'split',
 'learning_rate': 0.1,
 'max_bin': 170,
 'max_depth': 4,
 'min_child_samples': 34,
 'min_child_weight': 4,
 'min_split_gain': 0.8421052631578947,
 'n_estimators': 50,
 'n_jobs': 1,
 'num_leaves': 62,
 'objective': None,
 'random_state': None,
 'reg_alpha': 0.7894736842105263,
 'reg_lambda': 0.15789473684210525,
 'silent': True,
 'subsamp

In [15]:
# from azureml.train.automl import AutoMLConfig
# # https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train#configure-your-experiment-settings
# # https://docs.microsoft.com/en-us/python/api/azureml-train-automl/azureml.train.automl.automlconfig?view=azure-ml-py
# automl_config = AutoMLConfig(
#     task='classification',
#     training_data=df,
#     label_column_name='Survived',
#     primary_metric='AUC_weighted',
#     n_cross_validations=2,
    
#     # Try for at most 10 minutes, then give up.
#     experiment_timeout_minutes=1,
    
#     # Featurization
#     preprocess=True,
#     feauturization=featurization_config,
    
#     debug_log="automated_ml_errors.log",
#     verbosity=logging.INFO)

## NOTE: No direct support to take apart the AutoML pipeline, e.g. use just Featurization.

## Configuring Featurization

If the pipeline made a mistake, you can specify a `FeaturizationConfig` and re-run.

We can use this to, for example, block the tf-idf features on text.

```
featurization_config = FeaturizationConfig()
featurization_config.blocked_transformers = ['LabelEncoder']
featurization_config.drop_columns = ['aspiration', 'stroke']
featurization_config.add_column_purpose('engine-size', 'Numeric')
featurization_config.add_column_purpose('body-style', 'CategoricalHash')
featurization_config.add_transformer_params('Imputer', ['engine-size'], {"strategy": "median"})
featurization_config.add_transformer_params('Imputer', ['city-mpg'], {"strategy": "median"})
featurization_config.add_transformer_params('Imputer', ['bore'], {"strategy": "most_frequent"})
featurization_config.add_transformer_params('HashOneHotEncoder', [], {"number_of_bits": 3})

...

automl_config = AutoMLConfig(
    ...
    "feauturization": FeaturizationConfig
    )
```

See:
- https://docs.microsoft.com/en-us/python/api/azureml-train-automl/azureml.train.automl.automlconfig?view=azure-ml-py

In [None]:
from azureml.automl.core.featurization.featurizationconfig import FeaturizationConfig
#FeaturizationConfig?