# Welcome to the AutoGluon-Tabular Starter Notebook!

This notebook will show you how to get started using AutoGluon in less than 5 minutes, covering local training and deployment to a SageMaker cloud endpoint.

For this tutorial, we will be using autogluon==0.3.1

## NOTE: Please use conda_python3 Kernel.

## Resources:  
Documentation: https://auto.gluon.ai/stable/index.html  
Tutorials: https://auto.gluon.ai/stable/tutorials/tabular_prediction/index.html  
GitHub: https://github.com/awslabs/autogluon/  

![AutoGluon-1](notebook_images/AutoGluon-1.png)

![AutoGluon-2](notebook_images/AutoGluon-2.png)

![AutoGluon-3](notebook_images/AutoGluon-3.png)

![AutoGluon-4](notebook_images/AutoGluon-4.png)

# Local Training with AutoGluon

To begin, we first have to install AutoGluon. We can do so with pip. This may take a few minutes:

In [1]:
# Install CPU torch will speed-up installation because GPU version is larger
!pip install -q torch==1.10.1+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html

!pip install -q autogluon==0.3.1

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aiobotocore 1.3.0 requires botocore<1.20.50,>=1.20.49, but you have botocore 1.23.43 which is incompatible.[0m


In [2]:
from autogluon.tabular import TabularDataset, TabularPredictor

## Load the data

In [3]:
# Load the adult income dataset
train_data = TabularDataset('data/train.csv')
test_data = TabularDataset('data/test.csv')

#train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
#test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
label = 'class'

In [4]:
train_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,178478,Bachelors,13,Never-married,Tech-support,Own-child,White,Female,0,0,40,United-States,<=50K
1,23,State-gov,61743,5th-6th,3,Never-married,Transport-moving,Not-in-family,White,Male,0,0,35,United-States,<=50K
2,46,Private,376789,HS-grad,9,Never-married,Other-service,Not-in-family,White,Male,0,0,15,United-States,<=50K
3,55,?,200235,HS-grad,9,Married-civ-spouse,?,Husband,White,Male,0,0,50,United-States,>50K
4,36,Private,224541,7th-8th,4,Married-civ-spouse,Handlers-cleaners,Husband,White,Male,0,0,40,El-Salvador,<=50K


## Train on the data with TabularPredictor

Now we can fit on the data! Notice how we didn't have to do any preprocessing ourselves. AutoGluon automatically handles data cleaning, such as handling of string values, categoricals, dates, missing values, and more.

For the purposes of the demo, we will be using a fairly basic training configuration. For best results, specify `presets='best_quality'` in `.fit`.

In [5]:
%%time
# Note: We exclude the neural network models for a faster demo
predictor = TabularPredictor(
    label=label,
).fit(
    train_data,
    excluded_model_types=['NN', 'FASTAI'],
)

No path specified. Models will be saved in: "AutogluonModels/ag-20220204_111542/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20220204_111542/"
AutoGluon Version:  0.3.1
Train Data Rows:    39073
Train Data Columns: 14
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [' <=50K', ' >50K']
	If 'binary' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50K
	Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
	To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
F

CPU times: user 1min 38s, sys: 4.38 s, total: 1min 43s
Wall time: 1min 2s


In [6]:
# Get model leaderboard
predictor.leaderboard(test_data, silent=True)

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,XGBoost,0.875627,0.884,0.097504,0.026699,1.726374,0.097504,0.026699,1.726374,1,True,10
1,WeightedEnsemble_L2,0.87532,0.8888,0.450275,0.26901,7.225114,0.005166,0.005053,1.282624,2,True,12
2,CatBoost,0.874501,0.882,0.031218,0.01685,14.819867,0.031218,0.01685,14.819867,1,True,7
3,LightGBM,0.873375,0.88,0.068104,0.02892,0.831295,0.068104,0.02892,0.831295,1,True,4
4,LightGBMLarge,0.87143,0.8784,0.050828,0.025063,1.025842,0.050828,0.025063,1.025842,1,True,11
5,LightGBMXT,0.87143,0.8792,0.241088,0.080797,3.991288,0.241088,0.080797,3.991288,1,True,3
6,RandomForestGini,0.858532,0.864,1.714712,0.21107,8.341174,1.714712,0.21107,8.341174,1,True,5
7,RandomForestEntr,0.858225,0.8608,1.211617,0.210793,10.558017,1.211617,0.210793,10.558017,1,True,6
8,ExtraTreesGini,0.851264,0.8496,0.826217,0.310919,5.042042,0.826217,0.310919,5.042042,1,True,8
9,ExtraTreesEntr,0.851059,0.8496,0.816798,0.311288,5.445294,0.816798,0.311288,5.445294,1,True,9


## Local Prediction

In [7]:
# Get class predictions
predictor.predict(test_data)

0        <=50K
1        <=50K
2         >50K
3        <=50K
4        <=50K
         ...  
9764     <=50K
9765     <=50K
9766     <=50K
9767     <=50K
9768     <=50K
Name: class, Length: 9769, dtype: object

In [8]:
# Get prediction probabilities
predictor.predict_proba(test_data)

Unnamed: 0,<=50K,>50K
0,0.911898,0.088102
1,0.998610,0.001390
2,0.080540,0.919460
3,0.998625,0.001375
4,0.979027,0.020973
...,...,...
9764,0.980276,0.019724
9765,0.930274,0.069726
9766,0.849258,0.150742
9767,0.999579,0.000421


## Feature Importance

In [9]:
# Get feature importance
predictor.feature_importance(test_data, subsample_size=None)

Computing feature importance via permutation shuffling for 14 features using 9769 rows with 3 shuffle sets...
	21.58s	= Expected runtime (7.19s per shuffle set)
	17.15s	= Actual runtime (Completed 3 of 3 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
capital-gain,0.046883,0.000614,2.9e-05,3,0.050402,0.043364
marital-status,0.03391681,0.002255,0.000735,3,0.046839,0.020995
occupation,0.0180503,0.002824,0.004032,3,0.034235,0.001866
age,0.01590064,0.003532,0.008025,3,0.036138,-0.004336
capital-loss,0.01368274,0.001359,0.001637,3,0.021472,0.005894
education-num,0.01262497,0.000795,0.00066,3,0.017181,0.008069
hours-per-week,0.008291534,0.00057,0.000786,3,0.011557,0.005026
relationship,0.007984441,0.001154,0.003443,3,0.014595,0.001374
education,0.005391203,0.001434,0.011395,3,0.01361,-0.002828
workclass,0.003275668,0.001279,0.023607,3,0.010602,-0.00405


# Cloud Endpoint Deployment with AWS SageMaker

## Upload the local predictor to SageMaker

We will zip the model directory into a tar file so it can be uploaded to S3.  
Before zipping the model, we first minimize the size of the predictor by deleting any unnecessary models:

In [10]:
predictor.delete_models(models_to_keep='best', dry_run=False)
predictor.save_space()

Deleting model KNeighborsDist. All files under AutogluonModels/ag-20220204_111542/models/KNeighborsDist/ will be removed.
Deleting model LightGBM. All files under AutogluonModels/ag-20220204_111542/models/LightGBM/ will be removed.
Deleting model RandomForestGini. All files under AutogluonModels/ag-20220204_111542/models/RandomForestGini/ will be removed.
Deleting model RandomForestEntr. All files under AutogluonModels/ag-20220204_111542/models/RandomForestEntr/ will be removed.
Deleting model CatBoost. All files under AutogluonModels/ag-20220204_111542/models/CatBoost/ will be removed.
Deleting model ExtraTreesGini. All files under AutogluonModels/ag-20220204_111542/models/ExtraTreesGini/ will be removed.
Deleting model ExtraTreesEntr. All files under AutogluonModels/ag-20220204_111542/models/ExtraTreesEntr/ will be removed.
Deleting model LightGBMLarge. All files under AutogluonModels/ag-20220204_111542/models/LightGBMLarge/ will be removed.


In [11]:
import tarfile
import os.path

def make_tarfile(output_filename, source_dir):
    with tarfile.open(output_filename, "w:gz") as tar:
        tar.add(source_dir, arcname=os.path.basename(source_dir))

In [12]:
make_tarfile('model.tar.gz', predictor.path)

Next we will import several helper classes and functions that integrate the APIs of AutoGluon with SageMaker.  
The source code is maintained by AWS and is [available on GitHub](https://github.com/aws/amazon-sagemaker-examples/tree/master/advanced_functionality/autogluon-tabular-containers).

In [13]:
import sagemaker
import pandas as pd
import numpy as np
from sagemaker.serializers import CSVSerializer
import os

from ag_model import (
    AutoGluonTraining,
    AutoGluonInferenceModel,
    AutoGluonTabularPredictor,
)

Create a sagemaker session

In [14]:
sagemaker_session = sagemaker.session.Session()

Upload the zipped AutoGluon model to S3 

In [15]:
endpoint_name = sagemaker.utils.unique_name_from_base("sagemaker-autogluon-serving-trained-model")
model_data = sagemaker_session.upload_data(path='model.tar.gz', key_prefix=f'{endpoint_name}/models')
model_data

's3://sagemaker-us-west-2-328296961357/sagemaker-autogluon-serving-trained-model-1643973435-c53e/models/model.tar.gz'

## Create a SageMaker Endpoint

Here we specify the AWS region, the instance type we want to use as the endpoint, and the version of AutoGluon to use on the endpoint.  
It is important to use the same version of AutoGluon on the endpoint as was used during fit.

In [16]:
import autogluon.core

region = sagemaker_session._region_name
instance_type = "ml.m5.2xlarge"
framework_version = autogluon.core.__version__
print(f'Region: {region}, Instance Type: {instance_type}, Framework Version: {framework_version}')

Region: us-west-2, Instance Type: ml.m5.2xlarge, Framework Version: 0.3.1


The below cell is necessary for supporting KMS and VPC.

Refer to https://github.com/gradientsky/ag-vpc-setup/blob/master/lib/ag-vpc-stack.ts#L88-L90 for more details.

In [17]:
#specify vpc_config (optional)
vpc_config = {
    # security groups need to be configured to communicate
    # with each other for distributed training job
    "SecurityGroupIds": ["sg-0cea39349566bacc6"],
    "Subnets": ["subnet-078524620ffc22d15", "subnet-08b67c5fac3f0b048"]
}
vpc_config

{'SecurityGroupIds': ['sg-0cea39349566bacc6'],
 'Subnets': ['subnet-078524620ffc22d15', 'subnet-08b67c5fac3f0b048']}

Create the cloud model that will be used by endpoints. This will function identically to the local predictor.

In [18]:
model = AutoGluonInferenceModel(
    model_data=model_data,
    role=sagemaker.get_execution_role(),
    region=region,
    framework_version=framework_version,
    instance_type=instance_type,
    source_dir="scripts",
    entry_point="tabular_serve.py",
#    vpc_config=vpc_config,
)

Deploy the model to a cloud endpoint. This will take a few minutes for the instance to initialize.

In [19]:
predictor_endpoint = model.deploy(
    initial_instance_count=1,
    serializer=CSVSerializer(),
    instance_type=instance_type,
#    kms_key=kms_key,
)

Creating model with name: autogluon-inference-2022-02-04-11-17-19-961
Creating endpoint-config with name autogluon-inference-2022-02-04-11-17-20-250
Creating endpoint with name autogluon-inference-2022-02-04-11-17-20-250


------!

Now we have a cloud endpoint deployed and ready to predict on data! 

In [20]:
test_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,31,Private,169085,11th,7,Married-civ-spouse,Sales,Wife,White,Female,0,0,20,United-States,<=50K
1,17,Self-emp-not-inc,226203,12th,8,Never-married,Sales,Own-child,White,Male,0,0,45,United-States,<=50K
2,47,Private,54260,Assoc-voc,11,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,1887,60,United-States,>50K
3,21,Private,176262,Some-college,10,Never-married,Exec-managerial,Own-child,White,Female,0,0,30,United-States,<=50K
4,17,Private,241185,12th,8,Never-married,Prof-specialty,Own-child,White,Male,0,0,20,United-States,<=50K


## Cloud Prediction

SageMaker requires a slightly different input format compared to local AutoGluon.  
We first need to drop the label column and convert to numpy before calling predict.

In [21]:
test_data_endpoint = test_data.drop(columns=label).values
y_pred_endpoint = predictor_endpoint.predict(test_data_endpoint)

In [22]:
y_pred_endpoint[:10]

[[' <=50K', 0.9118983745574951, 0.08810162544250488],
 [' <=50K', 0.9986099004745483, 0.0013900835765525699],
 [' >50K', 0.08054006099700928, 0.9194599390029907],
 [' <=50K', 0.9986252784729004, 0.0013747334014624357],
 [' <=50K', 0.9790266156196594, 0.020973367616534233],
 [' >50K', 0.20037841796875, 0.79962158203125],
 [' >50K', 0.0017440319061279297, 0.9982559680938721],
 [' <=50K', 0.6143794059753418, 0.3856205940246582],
 [' <=50K', 0.9954660534858704, 0.004533917643129826],
 [' <=50K', 0.8534858226776123, 0.14651420712471008]]

The predictions returned by the endpoint contain both predictions and prediction probabilities.  
We can convert this result into the same format that the local predictor would return:

In [23]:
y_pred = pd.Series([p[0] for p in y_pred_endpoint], index=test_data.index, name=label)
y_pred_proba = pd.DataFrame([[p[1], p[2]] for p in y_pred_endpoint], columns=predictor.class_labels)

In [24]:
y_pred

0        <=50K
1        <=50K
2         >50K
3        <=50K
4        <=50K
         ...  
9764     <=50K
9765     <=50K
9766     <=50K
9767     <=50K
9768     <=50K
Name: class, Length: 9769, dtype: object

In [25]:
y_pred_proba

Unnamed: 0,<=50K,>50K
0,0.911898,0.088102
1,0.998610,0.001390
2,0.080540,0.919460
3,0.998625,0.001375
4,0.979027,0.020973
...,...,...
9764,0.980276,0.019724
9765,0.930274,0.069726
9766,0.849258,0.150742
9767,0.999579,0.000421


# Cloud Training & Batch Inference

Please refer to our comprehensive [AutoGluon Cloud Tutorial](https://github.com/aws/amazon-sagemaker-examples/blob/master/advanced_functionality/autogluon-tabular-containers/AutoGluon_Tabular_SageMaker_Containers.ipynb) including Cloud Training and Batch Inference.

# Cleanup

Delete the endpoint (shutting down the instance) when you are done to avoid being charged for the idle instance.

In [26]:
predictor_endpoint.delete_endpoint()

# Next Steps

Like what you saw? Check out the [AutoGluon Website](https://auto.gluon.ai/stable/index.html) for more tutorials!