# AutoGluon Tabular with SageMaker

[AutoGluon](https://github.com/awslabs/autogluon) automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications. With just a few lines of code, you can train and deploy high-accuracy deep learning models on tabular, image, and text data.
This notebook shows how to use AutoGluon-Tabular with Amazon SageMaker by creating custom containers.

## Prerequisites

If using a SageMaker hosted notebook, select kernel `conda_mxnet_p36`.

In [None]:
# Make sure docker compose is set up properly for local mode
!./setup.sh

In [None]:
# Imports
import os
import boto3
import sagemaker
from time import sleep
from collections import Counter
import pandas as pd
from sagemaker import get_execution_role, local, Model, utils, fw_utils, s3
from sagemaker.estimator import Estimator
from sagemaker.predictor import RealTimePredictor, csv_serializer, StringDeserializer
from sklearn.metrics import accuracy_score, classification_report
from IPython.core.display import display, HTML
from IPython.core.interactiveshell import InteractiveShell

# Print settings
InteractiveShell.ast_node_interactivity = "all"
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 10)

# Account/s3 setup
session = sagemaker.Session()
local_session = local.LocalSession()
bucket = session.default_bucket()
prefix = 'sagemaker/autogluon-tabular'
region = session.boto_region_name
role = get_execution_role()
client = session.boto_session.client(
    "sts", region_name=region, endpoint_url=utils.sts_regional_endpoint(region)
    )
account = client.get_caller_identity()['Account']
ecr_uri_prefix = utils.get_ecr_image_uri_prefix(account, region)
registry_id = fw_utils._registry_id(region, 'mxnet', 'py3', account, '1.6.0')
registry_uri = utils.get_ecr_image_uri_prefix(registry_id, region)

### Build docker images

First, build autogluon package to copy into docker image.

In [None]:
if not os.path.exists('package'):
    !pip install PrettyTable -t package
    !pip install bokeh -t package
    !pip install --pre autogluon -t package
    !pip install numpy==1.16.1 -t package    
    !pip install --upgrade boto3 -t package
    !pip install bokeh -t package
    !pip install --upgrade matplotlib -t package

Now build the training/inference image and push to ECR

In [None]:
training_algorithm_name = 'autogluon-sagemaker-training'
inference_algorithm_name = 'autogluon-sagemaker-inference'

In [None]:
!./container-training/build_push_training.sh {account} {region} {training_algorithm_name} {ecr_uri_prefix} {registry_id} {registry_uri}
!./container-inference/build_push_inference.sh {account} {region} {inference_algorithm_name} {ecr_uri_prefix} {registry_id} {registry_uri}

### Get the data

In this example we'll use the direct-marketing dataset to build a binary classification model that predicts whether customers will accept or decline a marketing offer.  
First we'll download the data and split it into train and test sets. AutoGluon does not require a separate validation set (it uses bagged k-fold cross-validation).

In [None]:
# Download and unzip the data
!aws s3 cp --region {region} s3://sagemaker-sample-data-{region}/autopilot/direct_marketing/bank-additional.zip .
!unzip -qq -o bank-additional.zip
!rm bank-additional.zip

local_data_path = './bank-additional/bank-additional-full.csv'
data = pd.read_csv(local_data_path)

# Split train/test data
train = data.sample(frac=0.7, random_state=42)
test = data.drop(train.index)

# Split test X/y
label = 'y'
y_test = test[label]
X_test = test.drop(columns=[label])

##### Check the data

In [None]:
train.head(3)
train.shape

test.head(3)
test.shape

X_test.head(3)
X_test.shape

Upload the data to s3

In [None]:
train_file = 'train.csv'
train.to_csv(train_file,index=False)
train_s3_path = session.upload_data(train_file, key_prefix='{}/data'.format(prefix))

test_file = 'test.csv'
test.to_csv(test_file,index=False)
test_s3_path = session.upload_data(test_file, key_prefix='{}/data'.format(prefix))

X_test_file = 'X_test.csv'
X_test.to_csv(X_test_file,index=False)
X_test_s3_path = session.upload_data(X_test_file, key_prefix='{}/data'.format(prefix))

## Train

The minimum requirement for hyperparameters is a target label.

In [None]:
hyperparameters = {'label': 'y'}

##### (Optional) hyperparameters can be passed to the `autogluon.task.TabularPrediction.fit` function.  

Below shows AutoGluon hyperparameters from the example [Predicting Columns in a Table - In Depth](https://autogluon.mxnet.io/tutorials/tabular_prediction/tabular-indepth.html#model-ensembling-with-stacking-bagging). Please see [fit parameters](https://autogluon.mxnet.io/api/autogluon.task.html?highlight=eval_metric#autogluon.task.TabularPrediction.fit) for further information.


Here's a more in depth example from the above tutorial that shows how to provide hyperparameter ranges and additional settings:

```python
nn_options = {
    'num_epochs': '10',
    'learning_rate': "ag.space.Real(1e-4, 1e-2, default=5e-4, log=True)",
    'activation': "ag.space.Categorical('relu', 'softrelu', 'tanh')",
    'layers': "ag.space.Categorical([100],[1000],[200,100],[300,200,100])",
    'dropout_prob': "ag.space.Real(0.0, 0.5, default=0.1)"
}

gbm_options = {
    'num_boost_round': '100',
    'num_leaves': "ag.space.Int(lower=26, upper=66, default=36)"
}

model_hps = {'NN': nn_options, 'GBM': gbm_options} 

hyperparameters = {
    'label': 'y',
    'time_limits': 2*60,
    'hyperparameters': model_hps,
    'auto_stack': False,    
    'hyperparameter_tune': True,
    'search_strategy': 'skopt'
}
```
**Note:** Your hyperparameter choices may affect the size of the model package, which could result in additional time taken to upload your model and complete training.

<br>

For local training set `train_instance_type` to `local` .  
For non-local training the recommended instance type is `ml.m5.2xlarge` .

In [None]:
%%time

instance_type = 'ml.m5.2xlarge'
#instance_type = 'local'

ecr_image = f'{ecr_uri_prefix}/{training_algorithm_name}:latest'

estimator = Estimator(image_name=ecr_image,
                      role=role,
                      train_instance_count=1,
                      train_instance_type=instance_type,
                      hyperparameters=hyperparameters)

estimator.fit(train_s3_path)

### Create Model

In [None]:
# Create predictor object
class AutoGluonTabularPredictor(RealTimePredictor):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, content_type='text/csv', 
                         serializer=csv_serializer, 
                         deserializer=StringDeserializer(), **kwargs)

In [None]:
ecr_image = f'{ecr_uri_prefix}/{inference_algorithm_name}:latest'

if instance_type == 'local':
    model = estimator.create_model(image=ecr_image, role=role)
else:
    model_uri = os.path.join(estimator.output_path, estimator._current_job_name, "output", "model.tar.gz")
    model = Model(model_uri, ecr_image, role=role, sagemaker_session=session, predictor_cls=AutoGluonTabularPredictor)

### Batch Transform

For local mode, either `s3://<bucket>/<prefix>/output/` or `file:///<absolute_local_path>` can be used as outputs.

By including the label column in the test data, you can also evaluate prediction performance (In this case, passing `test_s3_path` instead of `X_test_s3_path`).

In [None]:
output_path = f's3://{bucket}/{prefix}/output/'
# output_path = f'file://{os.getcwd()}'

transformer = model.transformer(instance_count=1, 
                                instance_type=instance_type,
                                strategy='SingleRecord',
                                max_payload=100,
                                max_concurrent_transforms=1,                              
                                output_path=output_path)

transformer.transform(test_s3_path, content_type='text/csv')
transformer.wait()

### Endpoint

##### Deploy remote or local endpoint

In [None]:
instance_type = 'ml.m5.2xlarge'
#instance_type = 'local'

predictor = model.deploy(initial_instance_count=1, 
                         instance_type=instance_type)

##### Attach to endpoint (or reattach if kernel was restarted)

In [None]:
# Select standard or local session based on instance_type
if instance_type == 'local': 
    sess = local_session
else: 
    sess = session

# Attach to endpoint
predictor = AutoGluonTabularPredictor(predictor.endpoint, sagemaker_session=sess)

##### Predict on unlabeled test data

In [None]:
results = predictor.predict(X_test.to_csv())

# Check output
print(Counter(results.splitlines()))

##### Predict on data that includes label column  
Prediction performance metrics will be printed to endpoint logs.

In [None]:
results = predictor.predict(test.to_csv())

# Check output
sleep(0.1); print(Counter(results.splitlines()))

##### Check that performance metrics match evaluation printed to endpoint logs as expected

In [None]:
import numpy as np
y_results = np.array(results.splitlines())

print("accuracy: {}".format(accuracy_score(y_true=y_test, y_pred=y_results)))
print(classification_report(y_true=y_test, y_pred=y_results, digits=6))

##### Clean up endpoint

In [None]:
predictor.delete_endpoint()