[![brightonr-banner-logo](images/notebanner.png)](https://www.meetup.com/en-AU/Silicon-Brighton-Brighton-R/)

<div style = "text-align: right"><font size = 5 color = "#0077be" face = "verdana"><b>AutoGluon - Get on the AutoML-wagon</b></font></div>
<div style = "text-align: right"><font><i>By 'Dayo Oguntoyinbo</i></font></div>
<div style = "text-align: right"><font>28th April 2022</font></div>

We will be using the Quick Start from the official documentation - https://auto.gluon.ai/stable/index.html

In [1]:
import os
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_auc_score

import warnings
warnings.simplefilter(action='ignore', category=RuntimeWarning) 
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning) 

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.options.display.max_colwidth = None
pd.set_option("display.float_format", lambda x: '%.2f' % x)
from pandas.core.common import SettingWithCopyWarning
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# AutoGluon Tabular Prediction

https://auto.gluon.ai/stable/tutorials/tabular_prediction/tabular-quickstart.html

In [2]:
# !pip install -Uq pip
# !pip install -Uq setuptools wheel

In [3]:
# # CPU version of pytorch has smaller footprint - see installation instructions in
# # pytorch documentation - https://pytorch.org/get-started/locally/
# !pip install -Uq torch==1.10.1+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html
# !pip install -Uq autogluon

In [4]:
from autogluon.tabular import TabularDataset, TabularPredictor

In [5]:
train_data = TabularDataset(data='inputs/train.csv')

In [6]:
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')

In [7]:
print(f"Size of Train data is: {train_data.shape}")
print(f"Size of Test data is: {test_data.shape}")

Size of Train data is: (39073, 15)
Size of Test data is: (9769, 15)


In [8]:
# # subsample subset of data for faster demo, try setting this to much larger values
# subsample_size = 500

# train_data = train_data.sample(n=subsample_size, random_state=42)

In [9]:
print(f"Size of Train data is: {train_data.shape}")

Size of Train data is: (39073, 15)


In [10]:
train_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,178478,Bachelors,13,Never-married,Tech-support,Own-child,White,Female,0,0,40,United-States,<=50K
1,23,State-gov,61743,5th-6th,3,Never-married,Transport-moving,Not-in-family,White,Male,0,0,35,United-States,<=50K
2,46,Private,376789,HS-grad,9,Never-married,Other-service,Not-in-family,White,Male,0,0,15,United-States,<=50K
3,55,?,200235,HS-grad,9,Married-civ-spouse,?,Husband,White,Male,0,0,50,United-States,>50K
4,36,Private,224541,7th-8th,4,Married-civ-spouse,Handlers-cleaners,Husband,White,Male,0,0,40,El-Salvador,<=50K


In [11]:
label = 'class'
print("Summary of class variable: \n", train_data[label].describe())

Summary of class variable: 
 count      39073
unique         2
top        <=50K
freq       29704
Name: class, dtype: object


Then, we use AutoGluon to train multiple models as follows (you can constraint your file path also):

In [12]:
save_path = 'agModels-predictClass'  # specifies folder to store trained models

## Train Data

In [13]:
%%time

predictor = (
    TabularPredictor(label=label, path=save_path)
    # eval_metric=eval_metric
).fit(train_data) # , time_limit=2*60)


Beginning AutoGluon training ...
AutoGluon will save models to "agModels-predictClass/"
AutoGluon Version:  0.3.1
Train Data Rows:    39073
Train Data Columns: 14
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [' <=50K', ' >50K']
	If 'binary' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50K
	Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
	To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    64870.44 MB
	Tra

CPU times: user 2min 37s, sys: 9.29 s, total: 2min 46s
Wall time: 25.8 s


In [14]:
predictor.leaderboard(silent=True)

Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,CatBoost,0.89,0.01,4.95,0.01,4.95,1,True,7
1,WeightedEnsemble_L2,0.89,0.02,6.01,0.0,1.06,2,True,12
2,XGBoost,0.88,0.02,0.66,0.02,0.66,1,True,10
3,LightGBM,0.88,0.02,0.38,0.02,0.38,1,True,4
4,LightGBMXT,0.88,0.02,1.29,0.02,1.29,1,True,3
5,LightGBMLarge,0.88,0.02,0.65,0.02,0.65,1,True,11
6,RandomForestGini,0.86,0.11,1.44,0.11,1.44,1,True,5
7,RandomForestEntr,0.86,0.11,1.73,0.11,1.73,1,True,6
8,ExtraTreesGini,0.85,0.11,1.23,0.11,1.23,1,True,8
9,ExtraTreesEntr,0.85,0.11,1.33,0.11,1.33,1,True,9


## Test Data

In [15]:
y_test = test_data[label]  # values to predict
test_data_nolab = test_data.drop(columns=[label])  # delete label column to prove we're not cheating

In [16]:
test_data_nolab.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,31,Private,169085,11th,7,Married-civ-spouse,Sales,Wife,White,Female,0,0,20,United-States
1,17,Self-emp-not-inc,226203,12th,8,Never-married,Sales,Own-child,White,Male,0,0,45,United-States
2,47,Private,54260,Assoc-voc,11,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,1887,60,United-States
3,21,Private,176262,Some-college,10,Never-married,Exec-managerial,Own-child,White,Female,0,0,30,United-States
4,17,Private,241185,12th,8,Never-married,Prof-specialty,Own-child,White,Male,0,0,20,United-States


We can use our saved model or use it directly to make predictions on the new data and then evaluate performance.

In [17]:
y_pred_direct = predictor.predict(test_data_nolab)

In [18]:
print("Predictions:  \n", y_pred_direct[0:5])

perf = predictor.evaluate_predictions(
    y_true=y_test,
    y_pred=y_pred_direct,
    auxiliary_metrics=True
)

Evaluation: accuracy on test data: 0.8753198894462074
Evaluations on test data:
{
    "accuracy": 0.8753198894462074,
    "balanced_accuracy": 0.7922548108093962,
    "mcc": 0.635971335494168,
    "f1": 0.7070707070707072,
    "precision": 0.7989130434782609,
    "recall": 0.634167385677308
}


Predictions:  
 0     <=50K
1     <=50K
2      >50K
3     <=50K
4     <=50K
Name: class, dtype: object


In [19]:
loaded_predictor = TabularPredictor.load(save_path)  # unnecessary, just demonstrates how to load previously-trained predictor from file

y_pred = loaded_predictor.predict(test_data_nolab)

print("Predictions:  \n", y_pred[0:5])
perf = predictor.evaluate_predictions(
    y_true=y_test,
    y_pred=y_pred,
    auxiliary_metrics=True
)

Predictions:  
 0     <=50K
1     <=50K
2      >50K
3     <=50K
4     <=50K
Name: class, dtype: object


Evaluation: accuracy on test data: 0.8753198894462074
Evaluations on test data:
{
    "accuracy": 0.8753198894462074,
    "balanced_accuracy": 0.7922548108093962,
    "mcc": 0.635971335494168,
    "f1": 0.7070707070707072,
    "precision": 0.7989130434782609,
    "recall": 0.634167385677308
}


We can evaluate the performance of individual trained model on our (labeled) test data.

In [20]:
predictor.leaderboard(test_data, silent=True)

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,XGBoost,0.88,0.88,0.06,0.02,0.66,0.06,0.02,0.66,1,True,10
1,CatBoost,0.88,0.89,0.02,0.01,4.95,0.02,0.01,4.95,1,True,7
2,WeightedEnsemble_L2,0.88,0.89,0.02,0.02,6.01,0.0,0.0,1.06,2,True,12
3,LightGBM,0.87,0.88,0.02,0.02,0.38,0.02,0.02,0.38,1,True,4
4,LightGBMLarge,0.87,0.88,0.02,0.02,0.65,0.02,0.02,0.65,1,True,11
5,LightGBMXT,0.87,0.88,0.04,0.02,1.29,0.04,0.02,1.29,1,True,3
6,RandomForestGini,0.86,0.86,0.36,0.11,1.44,0.36,0.11,1.44,1,True,5
7,RandomForestEntr,0.86,0.86,0.35,0.11,1.73,0.35,0.11,1.73,1,True,6
8,ExtraTreesGini,0.85,0.85,0.56,0.11,1.23,0.56,0.11,1.23,1,True,8
9,ExtraTreesEntr,0.85,0.85,0.49,0.11,1.33,0.49,0.11,1.33,1,True,9


**AutoGluon** automatically and iteratively tests values for hyperparameters to produce the best performance on the validation data. But you can tune hyperparameters also (please see the examples under the github examples).

This involves repeatedly training models under different hyperparameter settings and evaluating their performance. This process can be computationally-intensive, so `fit()` can parallelize this process across multiple threads (and machines if distributed resources are available). To control runtimes, you can specify various arguments in `fit()` as demonstrated in the subsequent In-Depth tutorial.

For tabular problems, `fit()` returns a Predictor object. For classification, you can easily output predicted class probabilities instead of predicted classes:

In [21]:
pred_probs = predictor.predict_proba(test_data_nolab)

In [22]:
pred_probs.head(5)

Unnamed: 0,<=50K,>50K
0,0.91,0.09
1,0.99,0.01
2,0.02,0.98
3,1.0,0.0
4,1.0,0.0


Besides inference, this object can also summarize what happened during fit.

In [23]:
results = predictor.fit_summary(show_plot=True)

*** Summary of fit() ***
Estimated performance of each model:
                  model  score_val  pred_time_val  fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0              CatBoost       0.89           0.01      4.95                    0.01               4.95            1       True          7
1   WeightedEnsemble_L2       0.89           0.02      6.01                    0.00               1.06            2       True         12
2               XGBoost       0.88           0.02      0.66                    0.02               0.66            1       True         10
3              LightGBM       0.88           0.02      0.38                    0.02               0.38            1       True          4
4            LightGBMXT       0.88           0.02      1.29                    0.02               1.29            1       True          3
5         LightGBMLarge       0.88           0.02      0.65                    0.02               0.65        

In [24]:
print("AutoGluon infers problem type is: ", predictor.problem_type)
print("AutoGluon identified the following types of features:")
print(predictor.feature_metadata)

AutoGluon infers problem type is:  binary
AutoGluon identified the following types of features:
('category', [])  : 7 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
('int', [])       : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('int', ['bool']) : 1 | ['sex']


## Maximizing predictive performance

In [25]:
time_limit = 60  # for quick demonstration only, you should set this to longest time you are willing to wait (in seconds)

metric = 'roc_auc'  # specify your evaluation metric here

In [26]:
%%time

predictor_max = (
    TabularPredictor(label, eval_metric=metric)
    .fit(train_data, time_limit=time_limit, presets='best_quality')
)

No path specified. Models will be saved in: "AutogluonModels/ag-20220428_150632/"
Presets specified: ['best_quality']
Beginning AutoGluon training ... Time limit = 60s
AutoGluon will save models to "AutogluonModels/ag-20220428_150632/"
AutoGluon Version:  0.3.1
Train Data Rows:    39073
Train Data Columns: 14
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [' <=50K', ' >50K']
	If 'binary' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50K
	Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
	To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.


[1000]	train_set's binary_logloss: 0.22337	valid_set's binary_logloss: 0.279745


	0.9244	 = Validation score   (roc_auc)
	15.39s	 = Training   runtime
	0.32s	 = Validation runtime
Fitting model: LightGBM_BAG_L1 ... Training model for up to 22.4s of the 42.31s of remaining time.
	0.9289	 = Validation score   (roc_auc)
	7.15s	 = Training   runtime
	0.19s	 = Validation runtime
Fitting model: RandomForestGini_BAG_L1 ... Training model for up to 14.95s of the 34.86s of remaining time.
	0.9061	 = Validation score   (roc_auc)
	1.53s	 = Training   runtime
	1.35s	 = Validation runtime
Fitting model: RandomForestEntr_BAG_L1 ... Training model for up to 11.51s of the 31.42s of remaining time.
	0.9067	 = Validation score   (roc_auc)
	1.93s	 = Training   runtime
	1.34s	 = Validation runtime
Fitting model: CatBoost_BAG_L1 ... Training model for up to 7.69s of the 27.6s of remaining time.
	Time limit exceeded... Skipping CatBoost_BAG_L1.
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L2 ... Training model for up to 59.72s of the 18.33s of remaining time

CPU times: user 11min 2s, sys: 10.6 s, total: 11min 12s
Wall time: 1min 7s


In [27]:
predictor_max.leaderboard(test_data, silent=True)

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,LightGBMXT_BAG_L2,0.93,0.93,2.2,3.91,33.73,0.21,0.19,7.07,2,True,8
1,LightGBM_BAG_L1,0.93,0.93,0.32,0.19,7.15,0.32,0.19,7.15,1,True,4
2,WeightedEnsemble_L2,0.93,0.93,0.43,0.41,14.14,0.0,0.01,6.66,2,True,7
3,LightGBM_BAG_L2,0.93,0.93,2.16,3.86,30.68,0.17,0.14,4.02,2,True,9
4,WeightedEnsemble_L3,0.93,0.93,2.68,5.3,43.19,0.0,0.01,3.4,3,True,11
5,LightGBMXT_BAG_L1,0.93,0.92,0.74,0.32,15.39,0.74,0.32,15.39,1,True,3
6,RandomForestGini_BAG_L2,0.93,0.93,2.29,4.96,28.69,0.31,1.25,2.03,2,True,10
7,RandomForestEntr_BAG_L1,0.91,0.91,0.36,1.34,1.93,0.36,1.34,1.93,1,True,6
8,RandomForestGini_BAG_L1,0.91,0.91,0.36,1.35,1.53,0.36,1.35,1.53,1,True,5
9,KNeighborsDist_BAG_L1,0.69,0.7,0.11,0.21,0.33,0.11,0.21,0.33,1,True,2


In [28]:
# Use the fitted model to make predictions on the test dataset
test_predictions = predictor.predict(test_data)

In [29]:
print(confusion_matrix(test_predictions, y_test))
print(classification_report(test_predictions, y_test))
# print("Accuracy (validation):", accuracy_score(test_predictions, y_test))
# print("ROC AUC (validation):", roc_auc_score(test_predictions, y_test))

[[7081  848]
 [ 370 1470]]
              precision    recall  f1-score   support

       <=50K       0.95      0.89      0.92      7929
        >50K       0.63      0.80      0.71      1840

    accuracy                           0.88      9769
   macro avg       0.79      0.85      0.81      9769
weighted avg       0.89      0.88      0.88      9769

