# Amazon SageMaker Autopilot Candidate Definition Notebook

This notebook was automatically generated by the AutoML job **Funders-USA-SEC-data-MVP-1**.
This notebook allows you to customize the candidate definitions and execute the SageMaker Autopilot workflow.

The dataset has **17** columns and the column named **Funding_Success** is used as
the target column. This is being treated as a **BinaryClassification** problem. The dataset also has **2** classes.
This notebook will build a **[BinaryClassification](https://en.wikipedia.org/wiki/Binary_classification)** model that
**maximizes** the "**ACCURACY**" quality metric of the trained models.
The "**ACCURACY**" metric provides the percentage of times the model predicted the correct class.

As part of the AutoML job, the input dataset has been randomly split into two pieces, one for **training** and one for
**validation**. This notebook helps you inspect and modify the data transformation approaches proposed by Amazon SageMaker Autopilot. You can interactively
train the data transformation models and use them to transform the data. Finally, you can execute a multiple algorithm hyperparameter optimization (multi-algo HPO)
job that helps you find the best model for your dataset by jointly optimizing the data transformations and machine learning algorithms.

<div class="alert alert-info"> 💡 <strong> Available Knobs</strong>
Look for sections like this for recommended settings that you can change.
</div>


---

## Contents

1. [Sagemaker Setup](#Sagemaker-Setup)
    1. [Downloading Generated Candidates](#Downloading-Generated-Modules)
    1. [SageMaker Autopilot Job and Amazon Simple Storage Service (Amazon S3) Configuration](#SageMaker-Autopilot-Job-and-Amazon-Simple-Storage-Service-(Amazon-S3)-Configuration)
1. [Candidate Pipelines](#Candidate-Pipelines)
    1. [Generated Candidates](#Generated-Candidates)
    1. [Selected Candidates](#Selected-Candidates)
1. [Executing the Candidate Pipelines](#Executing-the-Candidate-Pipelines)
    1. [Run Data Transformation Steps](#Run-Data-Transformation-Steps)
    1. [Multi Algorithm Hyperparameter Tuning](#Multi-Algorithm-Hyperparameter-Tuning)
1. [Model Selection and Deployment](#Model-Selection-and-Deployment)
    1. [Tuning Job Result Overview](#Tuning-Job-Result-Overview)
    1. [Model Deployment](#Model-Deployment)

---

## Sagemaker Model Testing

Before you launch the SageMaker Autopilot jobs, we'll setup the environment for Amazon SageMaker
- Check environment & dependencies.
- Create a few helper objects/function to organize input/output data and SageMaker sessions.

**Minimal Environment Requirements**

- Jupyter: Tested on `JupyterLab 1.0.6`, `jupyter_core 4.5.0` and `IPython 6.4.0`
- Kernel: `conda_python3`
- Dependencies required
  - `sagemaker-python-sdk>=2.19.0`
    - Use `!pip install sagemaker==2.19.0` to download this dependency.
    - Kernel may need to be restarted after download.
- Expected Execution Role/permission
  - S3 access to the bucket that stores the notebook.

### Downloading Generated Modules
Download the generated data transformation modules and an SageMaker Autopilot helper module used by this notebook.
Those artifacts will be downloaded to **Funders-USA-SEC-data-MVP-1-artifacts** folder.

In [1]:
!mkdir -p Funders-USA-SEC-data-MVP-1-artifacts
!aws s3 sync s3://sagemaker-us-east-1-570124035543/export-flow-04-21-57-21-dd5ae906/machine-learning-output/autopilot-output/Funders-USA-SEC-data-MVP-1/sagemaker-automl-candidates/pr-1-93f644fc87dd4359b02b3f5674282e148942c514864c4937956e91e3be/generated_module Funders-USA-SEC-data-MVP-1-artifacts/generated_module --only-show-errors
!aws s3 sync s3://sagemaker-us-east-1-570124035543/export-flow-04-21-57-21-dd5ae906/machine-learning-output/autopilot-output/Funders-USA-SEC-data-MVP-1/sagemaker-automl-candidates/pr-1-93f644fc87dd4359b02b3f5674282e148942c514864c4937956e91e3be/notebooks/sagemaker_automl Funders-USA-SEC-data-MVP-1-artifacts/sagemaker_automl --only-show-errors

import sys
sys.path.append("Funders-USA-SEC-data-MVP-1-artifacts")

### SageMaker Autopilot Job and Amazon Simple Storage Service (Amazon S3) Configuration

The following configuration has been derived from the SageMaker Autopilot job. These items configure where this notebook will
look for generated candidates, and where input and output data is stored on Amazon S3.

In [2]:
from sagemaker_automl import uid, AutoMLLocalRunConfig

# Where the preprocessed data from the existing AutoML job is stored
BASE_AUTOML_JOB_NAME = 'Funders-USA-SEC-data-MVP-1'
BASE_AUTOML_JOB_CONFIG = {
    'automl_job_name': BASE_AUTOML_JOB_NAME,
    'automl_output_s3_base_path': 's3://sagemaker-us-east-1-570124035543/export-flow-04-21-57-21-dd5ae906/machine-learning-output/autopilot-output/Funders-USA-SEC-data-MVP-1',
    'data_transformer_image_repo_version': '0.2-1-cpu-py3',
    'algo_image_repo_versions': {'xgboost': '1.0-1-cpu-py3', 'linear-learner': 'latest', 'mlp': 'training-cpu'},
    'algo_inference_image_repo_versions': {'xgboost': '1.0-1-cpu-py3', 'linear-learner': 'latest', 'mlp': 'inference-cpu'}
}

# Path conventions of the output data storage path from the local AutoML job run of this notebook
LOCAL_AUTOML_JOB_NAME = 'Funders-US-notebook-run-{}'.format(uid())
LOCAL_AUTOML_JOB_CONFIG = {
    'local_automl_job_name': LOCAL_AUTOML_JOB_NAME,
    'local_automl_job_output_s3_base_path': 's3://sagemaker-us-east-1-570124035543/export-flow-04-21-57-21-dd5ae906/machine-learning-output/autopilot-output/Funders-USA-SEC-data-MVP-1/{}'.format(LOCAL_AUTOML_JOB_NAME),
    'data_processing_model_dir': 'data-processor-models',
    'data_processing_transformed_output_dir': 'transformed-data',
    'multi_algo_tuning_output_dir': 'multi-algo-tuning'
}

AUTOML_LOCAL_RUN_CONFIG = AutoMLLocalRunConfig(
    role='arn:aws:iam::570124035543:role/service-role/AmazonSageMaker-ExecutionRole-20200602T133715',
    base_automl_job_config=BASE_AUTOML_JOB_CONFIG,
    local_automl_job_config=LOCAL_AUTOML_JOB_CONFIG,
    security_config={'EnableInterContainerTrafficEncryption': False, 'VpcConfig': {}})

AUTOML_LOCAL_RUN_CONFIG.display()

This notebook is initialized to use the following configuration: 
        <table>
        <tr><th colspan=2>Name</th><th>Value</th></tr>
        <tr><th>General</th><th>Role</th><td>arn:aws:iam::570124035543:role/service-role/AmazonSageMaker-ExecutionRole-20200602T133715</td></tr>
        <tr><th rowspan=2>Base AutoML Job</th><th>Job Name</th><td>Funders-USA-SEC-data-MVP-1</td></tr>
        <tr><th>Base Output S3 Path</th><td>s3://sagemaker-us-east-1-570124035543/export-flow-04-21-57-21-dd5ae906/machine-learning-output/autopilot-output/Funders-USA-SEC-data-MVP-1</td></tr>
        <tr><th rowspan=5>Interactive Job</th><th>Job Name</th><td>Funders-US-notebook-run-11-04-05-35</td></tr>
        <tr><th>Base Output S3 Path</th><td>s3://sagemaker-us-east-1-570124035543/export-flow-04-21-57-21-dd5ae906/machine-learning-output/autopilot-output/Funders-USA-SEC-data-MVP-1/Funders-US-notebook-run-11-04-05-35</td></tr>
        <tr><th>Data Processing Trained Model Directory</th><td>s3://sagemaker-us-east-1-570124035543/export-flow-04-21-57-21-dd5ae906/machine-learning-output/autopilot-output/Funders-USA-SEC-data-MVP-1/Funders-US-notebook-run-11-04-05-35/data-processor-models</td></tr>
        <tr><th>Data Processing Transformed Output</th><td>s3://sagemaker-us-east-1-570124035543/export-flow-04-21-57-21-dd5ae906/machine-learning-output/autopilot-output/Funders-USA-SEC-data-MVP-1/Funders-US-notebook-run-11-04-05-35/transformed-data</td></tr>
        <tr><th>Algo Tuning Model Output Directory</th><td>s3://sagemaker-us-east-1-570124035543/export-flow-04-21-57-21-dd5ae906/machine-learning-output/autopilot-output/Funders-USA-SEC-data-MVP-1/Funders-US-notebook-run-11-04-05-35/multi-algo-tuning</td></tr>
        </table>
        

## Candidate Pipelines

The `AutoMLLocalRunner` keeps track of selected candidates and automates many of the steps needed to execute feature engineering and tuning steps.

In [3]:
from sagemaker_automl import AutoMLInteractiveRunner, AutoMLLocalCandidate

automl_interactive_runner = AutoMLInteractiveRunner(AUTOML_LOCAL_RUN_CONFIG)

### Generated Candidates

The SageMaker Autopilot Job has analyzed the dataset and has generated **6** machine learning
pipeline(s) that use **3** algorithm(s). Each pipeline contains a set of feature transformers and an
algorithm.

<div class="alert alert-info"> 💡 <strong> Available Knobs</strong>

1. The resource configuration: instance type & count
1. Select candidate pipeline definitions by cells
1. The linked data transformation script can be reviewed and updated. Please refer to the [README.md](./Funders-USA-SEC-data-MVP-1-artifacts/generated_module/README.md) for detailed customization instructions.
</div>

**[dpp0-xgboost](Funders-USA-SEC-data-MVP-1-artifacts/generated_module/candidate_data_processors/dpp0.py)**: This data transformation strategy first transforms 'numeric' features using [RobustImputer (converts missing values to nan)](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/impute/base.py). It merges all the generated features and applies [RobustStandardScaler](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/data.py). The
transformed data will be used to tune a *xgboost* model. Here is the definition:

In [4]:
automl_interactive_runner.select_candidate({
    "data_transformer": {
        "name": "dpp0",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
            "volume_size_in_gb":  50
        },
        "transform_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
        "transforms_label": True,
        "transformed_data_format": "text/csv",
        "sparse_encoding": False
    },
    "algorithm": {
        "name": "xgboost",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
    }
})

2021-02-11 04:05:38,382 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2021-02-11 04:05:38,384 INFO sagemaker.image_uris: Defaulting to only available Python version: py3
2021-02-11 04:05:38,400 INFO sagemaker.image_uris: Defaulting to only supported image scope: cpu.
2021-02-11 04:05:38,402 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2021-02-11 04:05:38,403 INFO sagemaker.image_uris: Defaulting to only available Python version: py3
2021-02-11 04:05:38,471 INFO sagemaker.image_uris: Defaulting to only supported image scope: cpu.
2021-02-11 04:05:38,473 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2021-02-11 04:05:38,474 INFO sagemaker.image_uris: Defaulting to only available Python version: py3
2021-02-11 04:05:38,483 INFO sagemaker.image_uris: Defaulting to only supported image scope: cpu.


**[dpp1-xgboost](Funders-USA-SEC-data-MVP-1-artifacts/generated_module/candidate_data_processors/dpp1.py)**: This data transformation strategy first transforms 'numeric' features using [RobustImputer](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/impute/base.py). It merges all the generated features and applies [RobustPCA](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/decomposition/robust_pca.py) followed by [RobustStandardScaler](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/data.py). The
transformed data will be used to tune a *xgboost* model. Here is the definition:

In [5]:
automl_interactive_runner.select_candidate({
    "data_transformer": {
        "name": "dpp1",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
            "volume_size_in_gb":  50
        },
        "transform_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
        "transforms_label": True,
        "transformed_data_format": "text/csv",
        "sparse_encoding": False
    },
    "algorithm": {
        "name": "xgboost",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
    }
})

2021-02-11 04:05:42,689 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2021-02-11 04:05:42,691 INFO sagemaker.image_uris: Defaulting to only available Python version: py3
2021-02-11 04:05:42,710 INFO sagemaker.image_uris: Defaulting to only supported image scope: cpu.
2021-02-11 04:05:42,714 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2021-02-11 04:05:42,715 INFO sagemaker.image_uris: Defaulting to only available Python version: py3
2021-02-11 04:05:42,734 INFO sagemaker.image_uris: Defaulting to only supported image scope: cpu.
2021-02-11 04:05:42,737 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2021-02-11 04:05:42,738 INFO sagemaker.image_uris: Defaulting to only available Python version: py3
2021-02-11 04:05:42,756 INFO sagemaker.image_uris: Defaulting to only supported image scope: cpu.


**[dpp2-linear-learner](Funders-USA-SEC-data-MVP-1-artifacts/generated_module/candidate_data_processors/dpp2.py)**: This data transformation strategy first transforms 'numeric' features using [combined RobustImputer and RobustMissingIndicator](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/impute/base.py) followed by [QuantileExtremeValuesTransformer](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/base.py), 'categorical' features using [ThresholdOneHotEncoder](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/encoders.py). It merges all the generated features and applies [RobustPCA](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/decomposition/robust_pca.py) followed by [RobustStandardScaler](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/data.py). The
transformed data will be used to tune a *linear-learner* model. Here is the definition:

In [6]:
automl_interactive_runner.select_candidate({
    "data_transformer": {
        "name": "dpp2",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
            "volume_size_in_gb":  50
        },
        "transform_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
        "transforms_label": True,
        "transformed_data_format": "application/x-recordio-protobuf",
        "sparse_encoding": False
    },
    "algorithm": {
        "name": "linear-learner",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
    }
})

2021-02-11 04:07:02,386 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2021-02-11 04:07:02,387 INFO sagemaker.image_uris: Defaulting to only available Python version: py3
2021-02-11 04:07:02,398 INFO sagemaker.image_uris: Defaulting to only supported image scope: cpu.
2021-02-11 04:07:02,400 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2021-02-11 04:07:02,411 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2021-02-11 04:07:02,412 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2021-02-11 04:07:02,423 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.


**[dpp3-xgboost](Funders-USA-SEC-data-MVP-1-artifacts/generated_module/candidate_data_processors/dpp3.py)**: This data transformation strategy first transforms 'numeric' features using [RobustImputer (converts missing values to nan)](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/impute/base.py), 'categorical' features using [ThresholdOneHotEncoder](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/encoders.py). It merges all the generated features and applies [RobustStandardScaler](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/data.py). The
transformed data will be used to tune a *xgboost* model. Here is the definition:

In [7]:
automl_interactive_runner.select_candidate({
    "data_transformer": {
        "name": "dpp3",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
            "volume_size_in_gb":  50
        },
        "transform_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
        "transforms_label": True,
        "transformed_data_format": "application/x-recordio-protobuf",
        "sparse_encoding": True
    },
    "algorithm": {
        "name": "xgboost",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
    }
})

2021-02-11 04:07:03,449 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2021-02-11 04:07:03,451 INFO sagemaker.image_uris: Defaulting to only available Python version: py3
2021-02-11 04:07:03,462 INFO sagemaker.image_uris: Defaulting to only supported image scope: cpu.
2021-02-11 04:07:03,464 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2021-02-11 04:07:03,464 INFO sagemaker.image_uris: Defaulting to only available Python version: py3
2021-02-11 04:07:03,483 INFO sagemaker.image_uris: Defaulting to only supported image scope: cpu.
2021-02-11 04:07:03,485 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2021-02-11 04:07:03,486 INFO sagemaker.image_uris: Defaulting to only available Python version: py3
2021-02-11 04:07:03,495 INFO sagemaker.image_uris: Defaulting to only supported image scope: cpu.


**[dpp4-xgboost](Funders-USA-SEC-data-MVP-1-artifacts/generated_module/candidate_data_processors/dpp4.py)**: This data transformation strategy first transforms 'numeric' features using [RobustImputer (converts missing values to nan)](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/impute/base.py), 'categorical' features using [ThresholdOneHotEncoder](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/encoders.py). It merges all the generated features and applies [RobustStandardScaler](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/data.py). The
transformed data will be used to tune a *xgboost* model. Here is the definition:

In [8]:
automl_interactive_runner.select_candidate({
    "data_transformer": {
        "name": "dpp4",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
            "volume_size_in_gb":  50
        },
        "transform_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
        "transforms_label": True,
        "transformed_data_format": "application/x-recordio-protobuf",
        "sparse_encoding": True
    },
    "algorithm": {
        "name": "xgboost",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
    }
})

2021-02-11 04:07:04,043 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2021-02-11 04:07:04,044 INFO sagemaker.image_uris: Defaulting to only available Python version: py3
2021-02-11 04:07:04,055 INFO sagemaker.image_uris: Defaulting to only supported image scope: cpu.
2021-02-11 04:07:04,056 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2021-02-11 04:07:04,057 INFO sagemaker.image_uris: Defaulting to only available Python version: py3
2021-02-11 04:07:04,067 INFO sagemaker.image_uris: Defaulting to only supported image scope: cpu.
2021-02-11 04:07:04,068 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2021-02-11 04:07:04,069 INFO sagemaker.image_uris: Defaulting to only available Python version: py3
2021-02-11 04:07:04,078 INFO sagemaker.image_uris: Defaulting to only supported image scope: cpu.


**[dpp5-mlp](Funders-USA-SEC-data-MVP-1-artifacts/generated_module/candidate_data_processors/dpp5.py)**: This data transformation strategy transforms 'numeric' features using [RobustImputer](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/impute/base.py) followed by [RobustStandardScaler](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/data.py). The
transformed data will be used to tune a *mlp* model. Here is the definition:

In [9]:
automl_interactive_runner.select_candidate({
    "data_transformer": {
        "name": "dpp5",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
            "volume_size_in_gb":  50
        },
        "transform_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
        "transforms_label": True,
        "transformed_data_format": "text/csv",
        "sparse_encoding": False
    },
    "algorithm": {
        "name": "mlp",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
        "candidate_specific_static_hyperparameters": {
            "num_categorical_features": '0',
        }
    }
})

2021-02-11 04:07:04,381 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2021-02-11 04:07:04,382 INFO sagemaker.image_uris: Defaulting to only available Python version: py3
2021-02-11 04:07:04,390 INFO sagemaker.image_uris: Defaulting to only supported image scope: cpu.
2021-02-11 04:07:04,392 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2021-02-11 04:07:04,401 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2021-02-11 04:07:04,402 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2021-02-11 04:07:04,417 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.


### Selected Candidates

You have selected the following candidates (please run the cell below and click on the feature transformer links for details):

In [10]:
automl_interactive_runner.display_candidates()

Candidate Name,Algorithm,Feature Transformer
dpp0-xgboost,xgboost,dpp0.py
dpp1-xgboost,xgboost,dpp1.py
dpp2-linear-learner,linear-learner,dpp2.py
dpp3-xgboost,xgboost,dpp3.py
dpp4-xgboost,xgboost,dpp4.py
dpp5-mlp,mlp,dpp5.py


The feature engineering pipeline consists of two SageMaker jobs:

1. Generated trainable data transformer Python modules like [dpp0.py](Funders-USA-SEC-data-MVP-1-artifacts/generated_module/candidate_data_processors/dpp0.py), which has been downloaded to the local file system
2. A **training** job to train the data transformers
3. A **batch transform** job to apply the trained transformation to the dataset to generate the algorithm compatible data

The transformers and its training pipeline are built using open sourced **[sagemaker-scikit-learn-container][]** and **[sagemaker-scikit-learn-extension][]**.

[sagemaker-scikit-learn-container]: https://github.com/aws/sagemaker-scikit-learn-container
[sagemaker-scikit-learn-extension]: https://github.com/aws/sagemaker-scikit-learn-extension

## Executing the Candidate Pipelines

Each candidate pipeline consists of two steps, feature transformation and algorithm training.
For efficiency first execute the feature transformation step which will generate a featurized dataset on S3
for each pipeline.

After each featurized dataset is prepared, execute a multi-algorithm tuning job that will run tuning jobs
in parallel for each pipeline. This tuning job will execute training jobs to find the best set of
hyper-parameters for each pipeline, as well as finding the overall best performing pipeline.

### Run Data Transformation Steps

Now you are ready to start execution all data transformation steps.  The cell below may take some time to finish,
feel free to go grab a cup of coffee. To expedite the process you can set the number of `parallel_jobs` to be up to 10.
Please check the account limits to increase the limits before increasing the number of jobs to run in parallel.

In [11]:
automl_interactive_runner.fit_data_transformers(parallel_jobs=2)

2021-02-11 04:07:07,192 INFO root: [Worker_1:dpp1-xgboost]Executing step: train_data_transformer
2021-02-11 04:07:07,195 INFO sagemaker.image_uris: Defaulting to the only supported framework/algorithm version: latest.
2021-02-11 04:07:07,207 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2021-02-11 04:07:07,529 INFO sagemaker: Creating training-job with name: Funders-US-notebook-run-11-04-05-35-dpp1-train-11-04-07-07

2021-02-11 04:07:07 Starting - Starting the training job2021-02-11 04:07:11,194 INFO root: [Worker_0:dpp0-xgboost]Executing step: train_data_transformer
2021-02-11 04:07:11,195 INFO sagemaker.image_uris: Defaulting to the only supported framework/algorithm version: latest.
2021-02-11 04:07:11,207 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2021-02-11 04:07:11,346 INFO sagemaker: Creating training-job with name: Funders-US-notebook-run-11-04-05-35-dpp0-train-11-04-07-07

2021-02-11 04:07:11 Starting - Starting the training job

### Multi Algorithm Hyperparameter Tuning

Now that the algorithm compatible transformed datasets are ready, you can start the multi-algorithm model tuning job
to find the best predictive model. The following algorithm training job configuration for each
algorithm is auto-generated by the AutoML Job as part of the recommendation.

<div class="alert alert-info"> 💡 <strong> Available Knobs</strong>

1. Hyperparameter ranges
2. Objective metrics
3. Recommended static algorithm hyperparameters.

Please refers to [Xgboost tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost-tuning.html) and [Linear learner tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner-tuning.html) for detailed explanations of the parameters.
</div>

The AutoML recommendation job has recommended the following hyperparameters, objectives and accuracy metrics for
the algorithm and problem type:

In [12]:
ALGORITHM_OBJECTIVE_METRICS = {
    'xgboost': 'validation:accuracy',
    'linear-learner': 'validation:binary_classification_accuracy',
    'mlp': 'validation:accuracy',
}

STATIC_HYPERPARAMETERS = {
    'xgboost': {
        'objective': 'binary:logistic',
        'save_model_on_termination': 'true',
    },
    'linear-learner': {
        'predictor_type': 'binary_classifier',
        'loss': 'logistic',
        'mini_batch_size': 800,
        'binary_classifier_model_selection_criteria': 'loss_function',
        'num_models': 1,
    },
    'mlp': {
        'problem_type': 'binary_classification',
        'positive_example_weight_mult': 'auto',
        'ml_application': 'mlp',
        'use_batchnorm': 'true',
        'activation': 'relu',
        'warmup_epochs': 10,
        'eval_metric': 'accuracy',
    },
}

The following tunable hyperparameters search ranges are recommended for the Multi-Algo tuning job:

In [13]:
from sagemaker.parameter import CategoricalParameter, ContinuousParameter, IntegerParameter

ALGORITHM_TUNABLE_HYPERPARAMETER_RANGES = {
    'xgboost': {
        'num_round': IntegerParameter(2, 1024, scaling_type='Logarithmic'),
        'max_depth': IntegerParameter(2, 8, scaling_type='Logarithmic'),
        'eta': ContinuousParameter(1e-3, 1.0, scaling_type='Logarithmic'),
        'gamma': ContinuousParameter(1e-6, 64.0, scaling_type='Logarithmic'),
        'min_child_weight': ContinuousParameter(1e-6, 32.0, scaling_type='Logarithmic'),
        'subsample': ContinuousParameter(0.5, 1.0, scaling_type='Linear'),
        'colsample_bytree': ContinuousParameter(0.3, 1.0, scaling_type='Linear'),
        'lambda': ContinuousParameter(1e-6, 2.0, scaling_type='Logarithmic'),
        'alpha': ContinuousParameter(1e-6, 2.0, scaling_type='Logarithmic'),
    },
    'linear-learner': {
        'wd': ContinuousParameter(1e-7, 1.0, scaling_type='Logarithmic'),
        'l1': ContinuousParameter(1e-7, 1.0, scaling_type='Logarithmic'),
        'learning_rate': ContinuousParameter(1e-5, 1.0, scaling_type='Logarithmic'),
    },
    'mlp': {
        'mini_batch_size': IntegerParameter(128, 512, scaling_type='Linear'),
        'learning_rate': ContinuousParameter(1e-6, 1e-2, scaling_type='Logarithmic'),
        'weight_decay': ContinuousParameter(1e-12, 1e-2, scaling_type='Logarithmic'),
        'dropout_prob': ContinuousParameter(0.25, 0.5, scaling_type='Linear'),
        'embedding_size_factor': ContinuousParameter(0.65, 0.95, scaling_type='Linear'),
        'network_type': CategoricalParameter(['feedforward', 'widedeep']),
        'layers': CategoricalParameter(['256', '50, 25', '100, 50', '200, 100', '256, 128', '300, 150', '200, 100, 50']),
    },
}

#### Prepare Multi-Algorithm Tuner Input

To use the multi-algorithm HPO tuner, prepare some inputs and parameters. Prepare a dictionary whose key is the name of the trained pipeline candidates and the values are respectively:

1. Estimators for the recommended algorithm
2. Hyperparameters search ranges
3. Objective metrics

In [14]:
multi_algo_tuning_parameters = automl_interactive_runner.prepare_multi_algo_parameters(
    objective_metrics=ALGORITHM_OBJECTIVE_METRICS,
    static_hyperparameters=STATIC_HYPERPARAMETERS,
    hyperparameters_search_ranges=ALGORITHM_TUNABLE_HYPERPARAMETER_RANGES)

Below you prepare the inputs data to the multi-algo tuner:

In [15]:
multi_algo_tuning_inputs = automl_interactive_runner.prepare_multi_algo_inputs()

#### Create Multi-Algorithm Tuner

With the recommended Hyperparameter ranges and the transformed dataset, create a multi-algorithm model tuning job
that coordinates hyper parameter optimizations across the different possible algorithms and feature processing strategies.

<div class="alert alert-info"> 💡 <strong> Available Knobs</strong>

1. Tuner strategy: [Bayesian](https://en.wikipedia.org/wiki/Hyperparameter_optimization#Bayesian_optimization), [Random Search](https://en.wikipedia.org/wiki/Hyperparameter_optimization#Random_search)
2. Objective type: `Minimize`, `Maximize`, see [optimization](https://en.wikipedia.org/wiki/Mathematical_optimization)
3. Max Job size: the max number of training jobs HPO would be launching to run experiments. Note the default value is **250**
    which is the default of the managed flow.
4. Parallelism. Number of jobs that will be executed in parallel. Higher value will expedite the tuning process.
    Please check the account limits to increase the limits before increasing the number of jobs to run in parallel
5. Please use a different tuning job name if you re-run this cell after applied customizations.
</div>

In [16]:
from sagemaker.tuner import HyperparameterTuner

base_tuning_job_name = "{}-tuning".format(AUTOML_LOCAL_RUN_CONFIG.local_automl_job_name)

tuner = HyperparameterTuner.create(
    base_tuning_job_name=base_tuning_job_name,
    strategy='Bayesian',
    objective_type='Maximize',
    max_parallel_jobs=2,
    max_jobs=250,
    **multi_algo_tuning_parameters,
)

#### Run Multi-Algorithm Tuning

Now you are ready to start running the **Multi-Algo Tuning** job. After the job is finished, store the tuning job name which you use to select models in the next section.
The tuning process will take some time, please track the progress in the Amazon SageMaker Hyperparameter tuning jobs console.

In [None]:
from IPython.display import display, Markdown

# Run tuning
tuner.fit(inputs=multi_algo_tuning_inputs, include_cls_metadata=None)
tuning_job_name = tuner.latest_tuning_job.name

display(
    Markdown(f"Tuning Job {tuning_job_name} started, please track the progress from [here](https://{AUTOML_LOCAL_RUN_CONFIG.region}.console.aws.amazon.com/sagemaker/home?region={AUTOML_LOCAL_RUN_CONFIG.region}#/hyper-tuning-jobs/{tuning_job_name})"))

# Wait for tuning job to finish
tuner.wait()

2021-02-11 04:32:18,325 INFO sagemaker.image_uris: Defaulting to the only supported framework/algorithm version: latest.
2021-02-11 04:32:18,334 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2021-02-11 04:32:18,335 INFO sagemaker.image_uris: Defaulting to the only supported framework/algorithm version: latest.
2021-02-11 04:32:18,345 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2021-02-11 04:32:18,346 INFO sagemaker.image_uris: Defaulting to the only supported framework/algorithm version: latest.
2021-02-11 04:32:18,354 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2021-02-11 04:32:18,355 INFO sagemaker.image_uris: Defaulting to the only supported framework/algorithm version: latest.
2021-02-11 04:32:18,363 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2021-02-11 04:32:18,364 INFO sagemaker.image_uris: Defaulting to the only supported framework/algorithm version: latest.
2021-02-11 04:32:18,372

## Model Selection and Deployment

This section guides you through the model selection process. Afterward, you construct an inference pipeline
on Amazon SageMaker to host the best candidate.

Because you executed the feature transformation and algorithm training in two separate steps, you now need to manually
link each trained model with the feature transformer that it is associated with. When running a regular Amazon
SageMaker Autopilot job, this will automatically be done for you.

### Tuning Job Result Overview

The performance of each candidate pipeline can be viewed as a Pandas dataframe. For more interactive usage please
refers to [model tuning monitor](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-monitor.html).

In [37]:
from pprint import pprint
from sagemaker.analytics import HyperparameterTuningJobAnalytics

SAGEMAKER_SESSION = AUTOML_LOCAL_RUN_CONFIG.sagemaker_session
SAGEMAKER_ROLE = AUTOML_LOCAL_RUN_CONFIG.role

tuner_analytics = HyperparameterTuningJobAnalytics(
    tuner.latest_tuning_job.name, sagemaker_session=SAGEMAKER_SESSION)

df_tuning_job_analytics = tuner_analytics.dataframe()

# Sort the tuning job analytics by the final metrics value
df_tuning_job_analytics.sort_values(
    by=['FinalObjectiveValue'],
    inplace=True,
    ascending=False if tuner.objective_type == "Maximize" else True)

# Show detailed analytics for the top 20 models
import pandas as pd
pd.set_option('max_columns', 60)
df_tuning_job_analytics.head(5)

Unnamed: 0,dropout_prob,embedding_size_factor,layers,learning_rate,mini_batch_size,network_type,weight_decay,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds,TrainingJobDefinitionName,alpha,colsample_bytree,eta,gamma,lambda,max_depth,min_child_weight,num_round,subsample,l1,wd
243,0.478083,0.934566,"50, 25",0.000303,397.0,widedeep,1.937422e-06,Funders-US-notebook--210211-0432-007-8fd52f79,Completed,0.579412,2021-02-11 04:44:42+00:00,2021-02-11 04:46:17+00:00,95.0,dpp5-mlp,,,,,,,,,,,
217,,,,,,,,Funders-US-notebook--210211-0432-033-80d3bacd,Completed,0.57353,2021-02-11 05:31:18+00:00,2021-02-11 05:32:49+00:00,91.0,dpp1-xgboost,0.026206,0.849549,0.008609,2.366705,0.000999,6.0,0.766896,73.0,0.66897,,
3,0.482338,0.887138,"50, 25",0.000956,281.0,widedeep,3.838679e-06,Funders-US-notebook--210211-0432-247-547e3f52,Completed,0.573529,2021-02-11 12:58:34+00:00,2021-02-11 13:00:09+00:00,95.0,dpp5-mlp,,,,,,,,,,,
209,0.48018,0.90237,"50, 25",0.000338,388.0,widedeep,1.380099e-07,Funders-US-notebook--210211-0432-041-70acfa5f,Completed,0.573529,2021-02-11 05:48:01+00:00,2021-02-11 05:49:36+00:00,95.0,dpp5-mlp,,,,,,,,,,,
143,0.459914,0.917667,"50, 25",0.000589,455.0,widedeep,6.073554e-07,Funders-US-notebook--210211-0432-107-413444cb,Completed,0.567647,2021-02-11 08:08:37+00:00,2021-02-11 08:10:34+00:00,117.0,dpp5-mlp,,,,,,,,,,,


The best training job can be selected as below:

<div class="alert alert-info"> 💡 <strong>Tips: </strong>
You could select alternative job by using the value from `TrainingJobName` column above and assign to `best_training_job` below
</div>

In [27]:
attached_tuner = HyperparameterTuner.attach(tuner.latest_tuning_job.name, sagemaker_session=SAGEMAKER_SESSION)
best_training_job = attached_tuner.best_training_job()

print("Best Multi Algorithm HPO training job name is {}".format(best_training_job))

Best Multi Algorithm HPO training job name is Funders-US-notebook--210211-0432-007-8fd52f79


### Linking Best Training Job with Feature Pipelines

Finally, deploy the best training job to Amazon SageMaker along with its companion feature engineering models.
At the end of the section, you get an endpoint that's ready to serve online inference or start batch transform jobs!

Deploy a [PipelineModel](https://sagemaker.readthedocs.io/en/stable/pipeline.html) that has multiple containers of the following:

1. Data Transformation Container: a container built from the model we selected and trained during the data transformer sections
2. Algorithm Container: a container built from the trained model we selected above from the best HPO training job.
3. Inverse Label Transformer Container: a container that converts numerical intermediate prediction value back to non-numerical label value.

Get both best data transformation model and algorithm model from best training job and create an pipeline model:

In [28]:
from sagemaker.estimator import Estimator
from sagemaker import PipelineModel
from sagemaker_automl import select_inference_output

# Get a data transformation model from chosen candidate
best_candidate = automl_interactive_runner.choose_candidate(df_tuning_job_analytics, best_training_job)
best_data_transformer_model = best_candidate.get_data_transformer_model(role=SAGEMAKER_ROLE, sagemaker_session=SAGEMAKER_SESSION)

# Our first data transformation container will always return recordio-protobuf format
best_data_transformer_model.env["SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT"] = 'application/x-recordio-protobuf'
# Add environment variable for sparse encoding
if best_candidate.data_transformer_step.sparse_encoding:
    best_data_transformer_model.env["AUTOML_SPARSE_ENCODE_RECORDIO_PROTOBUF"] = '1'

# Get a algo model from chosen training job of the candidate
algo_estimator = Estimator.attach(best_training_job)
best_algo_model = algo_estimator.create_model(**best_candidate.algo_step.get_inference_container_config())

# Final pipeline model is composed of data transformation models and algo model and an
# inverse label transform model if we need to transform the intermediates back to non-numerical value
model_containers = [best_data_transformer_model, best_algo_model]
if best_candidate.transforms_label:
    model_containers.append(best_candidate.get_data_transformer_model(
        transform_mode="inverse-label-transform",
        role=SAGEMAKER_ROLE,
        sagemaker_session=SAGEMAKER_SESSION))

# This model can emit response ['predicted_label', 'probability', 'labels', 'probabilities']. To enable the model to emit one or more
# of the response content, pass the keys to `output_key` keyword argument in the select_inference_output method.

model_containers = select_inference_output("BinaryClassification", model_containers, output_keys=['predicted_label'])


pipeline_model = PipelineModel(
    name="AutoML-{}".format(AUTOML_LOCAL_RUN_CONFIG.local_automl_job_name),
    role=SAGEMAKER_ROLE,
    models=model_containers,
    vpc_config=AUTOML_LOCAL_RUN_CONFIG.vpc_config)

2021-02-11 20:54:18,578 INFO root: Chosen Data Processing pipeline candidate name is dpp5-mlp

2021-02-11 04:28:04 Starting - Preparing the instances for training
2021-02-11 04:28:04 Downloading - Downloading input data
2021-02-11 04:28:04 Training - Training image download completed. Training in progress.
2021-02-11 04:28:04 Uploading - Uploading generated training model
2021-02-11 04:28:04 Completed - Training job completed

2021-02-11 04:46:17 Starting - Preparing the instances for training
2021-02-11 04:46:17 Downloading - Downloading input data
2021-02-11 04:46:17 Training - Training image download completed. Training in progress.
2021-02-11 04:46:17 Uploading - Uploading generated training model
2021-02-11 04:46:17 Completed - Training job completed

2021-02-11 04:28:04 Starting - Preparing the instances for training
2021-02-11 04:28:04 Downloading - Downloading input data
2021-02-11 04:28:04 Training - Training image download completed. Training in progress.
2021-02-11 04:28:04 

### Code to show alg labels next to true labels

In [51]:
#Load in validation set to df
path = 's3://sagemaker-us-east-1-570124035543/export-flow-04-21-57-21-dd5ae906/machine-learning-output/autopilot-output/Funders-USA-SEC-data-MVP-1/Funders-US-notebook-run-11-04-05-35/transformed-data/dpp0/csv/validation/chunk_0.csv.out'
val_df = pd.read_csv(path)
# df_0to5 = pd.read_csv(path)
# path = 's3://sagemaker-us-east-1-570124035543/export-flow-04-21-57-21-dd5ae906/machine-learning-output/autopilot-output/Funders-USA-SEC-data-MVP-1/Funders-US-notebook-run-11-04-05-35/transformed-data/dpp0/csv/validation/chunk_1.csv.out'
# df_6to10 = pd.read_csv(path)
# val_df = df_0to5.append(df_6to10)
display(val_df.head(5))


predictor = pipeline_model.predictor_cls()
# get predicitons
# model_predictions = predictor.predict(val_df)
# display(model_predictions)

# val_df.insert(1,'model_predictions', model_predictions)
# ?algo_estimator
# ?best_algo_model
?pipeline_model
dir(pipeline_model)


Unnamed: 0,1.0,-0.5519929984313154,-1.1190904348376265,-0.6571774950022166,-0.17057581702560418,-0.13056050055005414,-0.33787003881689676,-0.2581718569154313,0.0,0.0.1,-0.2884414630948068,-0.24803741794451586,-0.3110545369120166,0.053473375144112066,0.31297971825027526,0.19748155461069614,-0.29817118085353106
0,0.0,0.794424,1.133006,-0.131178,-0.117575,0.103318,-0.330313,0.199294,0.000868,0.275212,0.935755,1.032625,-0.311055,0.0,-1.326807,-1.354797,-0.298171
1,1.0,1.746532,0.969302,0.570155,0.145832,0.111127,0.296923,0.53093,0.0,0.0,0.061946,0.045306,-0.311055,0.0,0.529368,0.457061,-0.298171
2,0.0,0.553992,1.133006,-0.481844,-0.170576,-0.133179,-0.33787,-0.267306,0.0,0.0,-0.288441,-0.248037,-0.311055,0.0,0.304929,0.240386,-0.298171
3,0.0,-0.45582,-1.252392,-0.832511,-0.163509,-0.133179,-0.281476,-0.267306,0.0,0.0,-0.288441,-0.248037,-0.311055,0.0,0.31298,0.240386,-0.298171
4,1.0,-0.551993,-1.11909,-0.481844,-0.088022,-0.13186,0.250008,-0.263898,0.0,0.0,-0.040165,-0.227759,-0.277925,0.0,0.156266,0.185507,-0.298171


['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_create_sagemaker_pipeline_model',
 'delete_model',
 'deploy',
 'enable_network_isolation',
 'endpoint_name',
 'models',
 'name',
 'pipeline_container_def',
 'predictor_cls',
 'role',
 'sagemaker_session',
 'transformer',
 'vpc_config']

[0;31mType:[0m           PipelineModel
[0;31mString form:[0m    <sagemaker.pipeline.PipelineModel object at 0x7f5799ba16d0>
[0;31mFile:[0m           /opt/conda/lib/python3.7/site-packages/sagemaker/pipeline.py
[0;31mDocstring:[0m     
A pipeline of SageMaker `Model` instances.

This pipeline can be deployed as an `Endpoint` on SageMaker.
[0;31mInit docstring:[0m
Initialize a SageMaker `Model` instance.

The `Model` can be used to build an Inference Pipeline comprising of
multiple model containers.

Args:
    models (list[sagemaker.Model]): For using multiple containers to
        build an inference pipeline, you can pass a list of
        ``sagemaker.Model`` objects in the order you want the inference
        to happen.
    role (str): An AWS IAM role (either name or full ARN). The Amazon
        SageMaker training jobs and APIs that create Amazon SageMaker
        endpoints use this role to access training data and model
        artifacts. After the endpoint is created, the 

### Deploying Best Pipeline

<div class="alert alert-info"> 💡 <strong> Available Knobs</strong>

1. You can customize the initial instance count and instance type used to deploy this model.
2. Endpoint name can be changed to avoid conflict with existing endpoints.

</div>

Finally, deploy the model to SageMaker to make it functional.

In [None]:
pipeline_model.deploy(initial_instance_count=1,
                      instance_type='ml.m5.2xlarge',
                      endpoint_name=pipeline_model.name,
                      wait=True)

Congratulations! Now you could visit the sagemaker
[endpoint console page](https://us-east-1.console.aws.amazon.com/sagemaker/home?region=us-east-1#/endpoints) to find the deployed endpoint (it'll take a few minutes to be in service).

<div class="alert alert-warning">
    <strong>To rerun this notebook, delete or change the name of your endpoint!</strong> <br>
If you rerun this notebook, you'll run into an error on the last step because the endpoint already exists. You can either delete the endpoint from the endpoint console page or you can change the <code>endpoint_name</code> in the previous code block.
</div>