`AutoGluon Tutorial`

- Author: Rui Zhu
- Date: 2025-01-11
- Follow: https://auto.gluon.ai/stable/tutorials/tabular/index.html#
- 原理: https://auto.gluon.ai/stable/tutorials/tabular/how-it-works.html

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

---
# Quick Start
- https://auto.gluon.ai/stable/tutorials/tabular/tabular-quick-start.html

In [2]:
from autogluon.tabular import TabularDataset, TabularPredictor

## Load Example Data

In [3]:
data_url = 'https://raw.githubusercontent.com/mli/ag-docs/main/knot_theory/'
train_data = TabularDataset(f'{data_url}train.csv')
test_data = TabularDataset(f'{data_url}test.csv')
train_data.head()

Unnamed: 0.1,Unnamed: 0,chern_simons,cusp_volume,hyperbolic_adjoint_torsion_degree,hyperbolic_torsion_degree,injectivity_radius,longitudinal_translation,meridinal_translation_imag,meridinal_translation_real,short_geodesic_imag_part,short_geodesic_real_part,Symmetry_0,Symmetry_D3,Symmetry_D4,Symmetry_D6,Symmetry_D8,Symmetry_Z/2 + Z/2,volume,signature
0,70746,0.09053,12.226322,0,10,0.507756,10.685555,1.144192,-0.519157,-2.760601,1.015512,0.0,0.0,0.0,0.0,0.0,1.0,11.393225,-2
1,240827,0.232453,13.800773,0,14,0.413645,10.453156,1.320249,-0.158522,-3.013258,0.827289,0.0,0.0,0.0,0.0,0.0,1.0,12.742782,0
2,155659,-0.144099,14.76103,0,14,0.436928,13.405199,1.101142,0.768894,2.233106,0.873856,0.0,0.0,0.0,0.0,0.0,0.0,15.236505,2
3,239963,-0.171668,13.738019,0,22,0.249481,27.819496,0.493827,-1.188718,-2.042771,0.498961,0.0,0.0,0.0,0.0,0.0,0.0,17.27989,-8
4,90504,0.235188,15.896359,0,10,0.389329,15.330971,1.036879,0.722828,-3.056138,0.778658,0.0,0.0,0.0,0.0,0.0,0.0,16.749298,4


In [4]:
label = 'signature'
train_data[label].describe()

count    10000.000000
mean        -0.022000
std          3.025166
min        -12.000000
25%         -2.000000
50%          0.000000
75%          2.000000
max         12.000000
Name: signature, dtype: float64

## Training

In [5]:
predictor = TabularPredictor(label=label).fit(train_data)

No path specified. Models will be saved in: "AutogluonModels/ag-20250111_070654"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.2
Python Version:     3.12.7
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 24.2.0: Fri Dec  6 19:01:59 PST 2024; root:xnu-11215.61.5~2/RELEASE_ARM64_T6000
CPU Count:          8
Memory Avail:       4.75 GB / 16.00 GB (29.7%)
Disk Space Avail:   322.92 GB / 926.35 GB (34.9%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
	presets='best'         : Maximize accuracy. Recommended for most users. Use in competitions and

## Prediction

In [6]:
print("Making predictions...")
y_pred = predictor.predict(test_data.drop(columns=[label]))
y_pred.head()

Making predictions...


0   -4
1   -2
2    0
3    4
4    2
Name: signature, dtype: int64

## Evaluation

In [7]:
predictor.evaluate(test_data, silent=True)

{'accuracy': 0.9504,
 'balanced_accuracy': 0.7477348536609515,
 'mcc': 0.9392153892336176}

In [8]:
predictor.leaderboard(test_data)

Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,0.9504,0.962963,accuracy,0.347248,0.086445,10.904956,0.00367,0.000365,0.058467,2,True,14
1,LightGBM,0.9456,0.955956,accuracy,0.089059,0.014918,8.945264,0.089059,0.014918,8.945264,1,True,5
2,XGBoost,0.9448,0.956957,accuracy,0.193522,0.041134,6.242307,0.193522,0.041134,6.242307,1,True,11
3,LightGBMLarge,0.9444,0.94995,accuracy,0.243649,0.044126,21.454955,0.243649,0.044126,21.454955,1,True,13
4,CatBoost,0.9432,0.955956,accuracy,0.031643,0.002922,13.002642,0.031643,0.002922,13.002642,1,True,8
5,RandomForestEntr,0.9382,0.946947,accuracy,0.090051,0.037654,0.956485,0.090051,0.037654,0.956485,1,True,7
6,NeuralNetFastAI,0.9356,0.93994,accuracy,0.029684,0.005541,4.089429,0.029684,0.005541,4.089429,1,True,3
7,RandomForestGini,0.9354,0.943944,accuracy,0.105055,0.038966,0.77258,0.105055,0.038966,0.77258,1,True,6
8,NeuralNetTorch,0.9352,0.947948,accuracy,0.019988,0.006221,80.552412,0.019988,0.006221,80.552412,1,True,12
9,ExtraTreesEntr,0.935,0.944945,accuracy,0.118281,0.036866,0.57352,0.118281,0.036866,0.57352,1,True,10


---
# 基本功能
- AutoGluon不需要执行任何数据预处理, 例如: 缺失值填充, one-hot-encoding

## 加载数据

In [11]:
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
subsample_size = 500  # subsample subset of data for faster demo, try setting this to much larger values
train_data = train_data.sample(n=subsample_size, random_state=0)
train_data.head()

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv | Columns = 15 / 15 | Rows = 39073 -> 39073


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
6118,51,Private,39264,Some-college,10,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,>50K
23204,58,Private,51662,10th,6,Married-civ-spouse,Other-service,Wife,White,Female,0,0,8,United-States,<=50K
29590,40,Private,326310,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,44,United-States,<=50K
18116,37,Private,222450,HS-grad,9,Never-married,Sales,Not-in-family,White,Male,0,2339,40,El-Salvador,<=50K
33964,62,Private,109190,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,15024,0,40,United-States,>50K


In [12]:
label = 'class'
print(f"Unique classes: {list(train_data[label].unique())}")

Unique classes: [' >50K', ' <=50K']


## 训练

In [13]:
predictor = TabularPredictor(label=label).fit(train_data)

No path specified. Models will be saved in: "AutogluonModels/ag-20250111_075946"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.2
Python Version:     3.12.7
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 24.2.0: Fri Dec  6 19:01:59 PST 2024; root:xnu-11215.61.5~2/RELEASE_ARM64_T6000
CPU Count:          8
Memory Avail:       4.46 GB / 16.00 GB (27.9%)
Disk Space Avail:   322.42 GB / 926.35 GB (34.8%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
	presets='best'         : Maximize accuracy. Recommended for most users. Use in competitions and

In [27]:
print("AutoGluon infers problem type is: ", predictor.problem_type)
print("AutoGluon identified the following types of features:")
print(predictor.feature_metadata)

AutoGluon infers problem type is:  binary
AutoGluon identified the following types of features:
('category', [])  : 7 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
('int', [])       : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('int', ['bool']) : 1 | ['sex']


In [28]:
test_data_transform = predictor.transform_features(test_data)
test_data_transform.head()

Unnamed: 0,age,fnlwgt,education-num,sex,capital-gain,capital-loss,hours-per-week,workclass,education,marital-status,occupation,relationship,race,native-country
0,31,169085,7,0,0,0,20,3,1,1,10,5,4,14
1,17,226203,8,1,0,0,45,5,2,3,10,3,4,14
2,47,54260,11,1,0,1887,60,3,7,1,3,0,4,14
3,21,176262,10,0,0,0,30,3,13,3,3,3,4,14
4,17,241185,8,1,0,0,20,3,2,3,8,3,4,14


In [29]:
predictor.feature_importance(test_data)

Computing feature importance via permutation shuffling for 14 features using 5000 rows with 5 shuffle sets...
	2.52s	= Expected runtime (0.5s per shuffle set)
	1.11s	= Actual runtime (Completed 5 of 5 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
marital-status,0.0508,0.003792,3.698489e-06,5,0.058608,0.042992
capital-gain,0.03852,0.002318,1.565361e-06,5,0.043292,0.033748
education-num,0.02968,0.001346,5.063512e-07,5,0.032452,0.026908
age,0.015,0.00285,0.000149044,5,0.020867,0.009133
hours-per-week,0.01172,0.003974,0.00136943,5,0.019902,0.003538
occupation,0.00528,0.001803,0.001406849,5,0.008993,0.001567
relationship,0.00472,0.001154,0.0003967984,5,0.007096,0.002344
native-country,0.00144,0.000654,0.003959537,5,0.002787,9.3e-05
capital-loss,0.00128,0.000415,0.001155921,5,0.002134,0.000426
fnlwgt,0.00108,0.002361,0.1820562,5,0.00594,-0.00378


## 输出不同模型的预测结果

In [31]:
predictor.model_best  # returns the best model

'WeightedEnsemble_L2'

In [34]:
predictor.model_names()

['KNeighborsUnif',
 'KNeighborsDist',
 'LightGBMXT',
 'LightGBM',
 'RandomForestGini',
 'RandomForestEntr',
 'CatBoost',
 'ExtraTreesGini',
 'ExtraTreesEntr',
 'NeuralNetFastAI',
 'XGBoost',
 'NeuralNetTorch',
 'LightGBMLarge',
 'WeightedEnsemble_L2']

In [35]:
predictor.predict(test_data, model='XGBoost')

0        <=50K
1        <=50K
2         >50K
3        <=50K
4        <=50K
         ...  
9764     <=50K
9765     <=50K
9766     <=50K
9767     <=50K
9768     <=50K
Name: class, Length: 9769, dtype: object

In [32]:
predictor.predict(test_data, model='LightGBM')

0        <=50K
1        <=50K
2         >50K
3        <=50K
4        <=50K
         ...  
9764     <=50K
9765     <=50K
9766     <=50K
9767     <=50K
9768     <=50K
Name: class, Length: 9769, dtype: object

## 加载测试数据

In [14]:
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
test_data.head()

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,31,Private,169085,11th,7,Married-civ-spouse,Sales,Wife,White,Female,0,0,20,United-States,<=50K
1,17,Self-emp-not-inc,226203,12th,8,Never-married,Sales,Own-child,White,Male,0,0,45,United-States,<=50K
2,47,Private,54260,Assoc-voc,11,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,1887,60,United-States,>50K
3,21,Private,176262,Some-college,10,Never-married,Exec-managerial,Own-child,White,Female,0,0,30,United-States,<=50K
4,17,Private,241185,12th,8,Never-married,Prof-specialty,Own-child,White,Male,0,0,20,United-States,<=50K


## 预测

In [15]:
y_pred = predictor.predict(test_data)
y_pred.head()  # Predictions

0     <=50K
1     <=50K
2      >50K
3     <=50K
4     <=50K
Name: class, dtype: object

In [16]:
y_pred_proba = predictor.predict_proba(test_data)
y_pred_proba.head()  # Prediction Probabilities

Unnamed: 0,<=50K,>50K
0,0.981126,0.018874
1,0.983599,0.016401
2,0.478133,0.521867
3,0.994751,0.005249
4,0.988539,0.011461


## 模型评估

In [17]:
predictor.evaluate(test_data)

{'accuracy': 0.8409253761899887,
 'balanced_accuracy': 0.7475663839529563,
 'mcc': 0.5345297121913682,
 'roc_auc': 0.884716037791454,
 'f1': 0.6296472831267874,
 'precision': 0.7034078807241747,
 'recall': 0.5698878343399483}

In [19]:
predictor.leaderboard(test_data)

Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,CatBoost,0.842461,0.85,accuracy,0.0213,0.00192,0.647981,0.0213,0.00192,0.647981,1,True,7
1,RandomForestGini,0.842461,0.84,accuracy,0.074294,0.015497,0.249086,0.074294,0.015497,0.249086,1,True,5
2,XGBoost,0.840925,0.86,accuracy,0.034128,0.003138,0.300834,0.034128,0.003138,0.300834,1,True,11
3,WeightedEnsemble_L2,0.840925,0.86,accuracy,0.03709,0.003398,0.324949,0.002962,0.00026,0.024115,2,True,14
4,RandomForestEntr,0.840925,0.83,accuracy,0.05692,0.01474,0.283375,0.05692,0.01474,0.283375,1,True,6
5,LightGBM,0.839799,0.85,accuracy,0.012339,0.001971,0.419642,0.012339,0.001971,0.419642,1,True,4
6,NeuralNetTorch,0.837138,0.83,accuracy,0.042548,0.004347,0.660227,0.042548,0.004347,0.660227,1,True,12
7,LightGBMXT,0.836421,0.83,accuracy,0.01346,0.001537,0.277095,0.01346,0.001537,0.277095,1,True,3
8,ExtraTreesGini,0.834374,0.82,accuracy,0.056257,0.026017,0.246801,0.056257,0.026017,0.246801,1,True,8
9,ExtraTreesEntr,0.832839,0.81,accuracy,0.06391,0.02653,0.228151,0.06391,0.02653,0.228151,1,True,9


## 加载模型

In [21]:
predictor.path  # path to directory containing all models

'/Users/rui/Code/Astronote/34_autogluon/AutogluonModels/ag-20250111_075946'

In [26]:
predictor = TabularPredictor.load(predictor.path)
predictor.leaderboard(test_data)

Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,CatBoost,0.842461,0.85,accuracy,0.004747,0.00192,0.647981,0.004747,0.00192,0.647981,1,True,7
1,RandomForestGini,0.842461,0.84,accuracy,0.060447,0.015497,0.249086,0.060447,0.015497,0.249086,1,True,5
2,XGBoost,0.840925,0.86,accuracy,0.026112,0.003138,0.300834,0.026112,0.003138,0.300834,1,True,11
3,WeightedEnsemble_L2,0.840925,0.86,accuracy,0.027289,0.003398,0.324949,0.001177,0.00026,0.024115,2,True,14
4,RandomForestEntr,0.840925,0.83,accuracy,0.055627,0.01474,0.283375,0.055627,0.01474,0.283375,1,True,6
5,LightGBM,0.839799,0.85,accuracy,0.009563,0.001971,0.419642,0.009563,0.001971,0.419642,1,True,4
6,NeuralNetTorch,0.837138,0.83,accuracy,0.039182,0.004347,0.660227,0.039182,0.004347,0.660227,1,True,12
7,LightGBMXT,0.836421,0.83,accuracy,0.006117,0.001537,0.277095,0.006117,0.001537,0.277095,1,True,3
8,ExtraTreesGini,0.834374,0.82,accuracy,0.060073,0.026017,0.246801,0.060073,0.026017,0.246801,1,True,8
9,ExtraTreesEntr,0.832839,0.81,accuracy,0.062098,0.02653,0.228151,0.062098,0.02653,0.228151,1,True,9
