# Quick Start

In this tutorial, we will see how to use AutoGluon's TabularPredictor to predict the values of a target column based on other columns in the dataset. In this quickstart section we saw AutoGluon’s basic fit and predict functionality using TabularDataset and TabularPredictor. AutoGluon simplifies the model training process by not requiring feature engineering or model hyperparameter tuning.

Start by ensuring that AutoGluon is installed, then import AutoGluon's TabularDataset and TabularPredictor. We will use the former to load data and the latter to train models and make predictions.


In [1]:

try:
    from autogluon.tabular import TabularPredictor, TabularDataset
except ImportError:
    !python -m pip install --upgrade pip
    !python -m pip install autogluon
    from autogluon.tabular import TabularPredictor, TabularDataset

Collecting pip
  Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.1.1
Collecting autogluon
  Downloading autogluon-1.3.1-py3-none-any.whl.metadata (11 kB)
Collecting autogluon.core==1.3.1 (from autogluon.core[all]==1.3.1->autogluon)
  Downloading autogluon.core-1.3.1-py3-none-any.whl.metadata (12 kB)
Collecting autogluon.features==1.3.1 (from autogluon)
  Downloading autogluon.features-1.3.1-py3-none-any.whl.metadata (11 kB)
Collecting autogluon.tabular==1.3.1 (from autogluon.tabular[all]==1.3.1->autogluon)
  Downloading autogluon.tabular-1.3.1-py3-none-any.whl.metadata (14 kB)
Collecting autogluon.multim

## Example Data

For this tutorial we will use a dataset from the cover story of Nature issue 7887: AI-guided intuition for math theorems. The goal is to predict a knot’s signature based on its properties. We sampled 10K training and 5K test examples from the original data. The sampled dataset make this tutorial run quickly, but AutoGluon can handle the full dataset if desired.

We load this dataset directly from a URL. AutoGluon’s TabularDataset is a subclass of pandas DataFrame, so any DataFrame methods can be used on TabularDataset as well.

In [2]:
data_url = 'https://raw.githubusercontent.com/mli/ag-docs/main/knot_theory/'
train_data = TabularDataset(f'{data_url}train.csv')
train_data.head()

Unnamed: 0.1,Unnamed: 0,chern_simons,cusp_volume,hyperbolic_adjoint_torsion_degree,hyperbolic_torsion_degree,injectivity_radius,longitudinal_translation,meridinal_translation_imag,meridinal_translation_real,short_geodesic_imag_part,short_geodesic_real_part,Symmetry_0,Symmetry_D3,Symmetry_D4,Symmetry_D6,Symmetry_D8,Symmetry_Z/2 + Z/2,volume,signature
0,70746,0.09053,12.226322,0,10,0.507756,10.685555,1.144192,-0.519157,-2.760601,1.015512,0.0,0.0,0.0,0.0,0.0,1.0,11.393225,-2
1,240827,0.232453,13.800773,0,14,0.413645,10.453156,1.320249,-0.158522,-3.013258,0.827289,0.0,0.0,0.0,0.0,0.0,1.0,12.742782,0
2,155659,-0.144099,14.76103,0,14,0.436928,13.405199,1.101142,0.768894,2.233106,0.873856,0.0,0.0,0.0,0.0,0.0,0.0,15.236505,2
3,239963,-0.171668,13.738019,0,22,0.249481,27.819496,0.493827,-1.188718,-2.042771,0.498961,0.0,0.0,0.0,0.0,0.0,0.0,17.27989,-8
4,90504,0.235188,15.896359,0,10,0.389329,15.330971,1.036879,0.722828,-3.056138,0.778658,0.0,0.0,0.0,0.0,0.0,0.0,16.749298,4


Our targets are stored in the “signature” column, which has 18 unique integers. Although Pandas does not correctly define this data type as categorical, AutoGluon will solve this problem.

In [3]:
label = 'signature'
train_data[label].describe()

Unnamed: 0,signature
count,10000.0
mean,-0.022
std,3.025166
min,-12.0
25%,-2.0
50%,0.0
75%,2.0
max,12.0


## Training
Now we create a TabularPredictor by specifying the label column name, and then train it on the dataset using TabularPredictor.fit(). We don't need to specify any other parameters. AutoGluon will understand that this is a multi-class classification task, perform automatic feature engineering, train multiple models, and then combine the models to create the final predictor.

In [4]:
predictor = TabularPredictor(label=label).fit(train_data)

No path specified. Models will be saved in: "AutogluonModels/ag-20250621_091756"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.3.1
Python Version:     3.11.13
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP PREEMPT_DYNAMIC Sun Mar 30 16:01:29 UTC 2025
CPU Count:          2
Memory Avail:       11.33 GB / 12.67 GB (89.4%)
Disk Space Avail:   183.31 GB / 225.83 GB (81.2%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
	presets='best'         : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
	presets='high'         : Strong 

Model fitting should take a few minutes or less depending on your CPU. You can make training faster by specifying the time_limit argument. For example, fit(..., time_limit=60) will stop training after 60 seconds. Higher time limits will generally result in better prediction performance, and excessively low time limits will prevent AutoGluon from training and ensembling a reasonable set of models.

## Prediction
Once we have a predictor that is fit on the training dataset, we can load a separate set of data to use for prediction and evaulation.

In [5]:
test_data = TabularDataset(f'{data_url}test.csv')

y_pred = predictor.predict(test_data.drop(columns=[label]))
y_pred.head()

Loaded data from: https://raw.githubusercontent.com/mli/ag-docs/main/knot_theory/test.csv | Columns = 19 / 19 | Rows = 5000 -> 5000


Unnamed: 0,signature
0,-4
1,-2
2,0
3,4
4,2


In [6]:
y_pred_proba = predictor.predict_proba(test_data)
y_pred_proba.head()  # Prediction Probabilities

Unnamed: 0,-12,-10,-8,-6,-4,-2,0,2,4,6,8,10,12
0,0.0,0.0,8.891126e-07,0.010039,0.886592,0.035863,0.033334,0.033334,0.000834,3e-06,9.809706e-07,0.0,0.0
1,0.0,0.0,1.404176e-05,0.000843,0.043468,0.450109,0.395773,0.107244,0.00168,0.000847,2.121599e-05,0.0,0.0
2,0.0,0.0,2.683037e-07,0.000833,0.041667,0.074213,0.834091,0.046695,0.001667,0.000833,1.453546e-07,0.0,0.0
3,0.0,0.0,9.395511e-07,0.000836,0.067507,0.066667,0.033333,0.004222,0.814425,0.013007,2.668471e-06,0.0,0.0
4,0.0,0.0,1.536764e-06,1e-06,0.066668,0.035848,0.089408,0.771809,0.002928,3e-06,0.03333408,0.0,0.0


## Evaluation
We can evaluate the predictor on the test dataset using the evaluate() function, which measures how well our predictor performs on data that was not used for fitting the models.

In [8]:
predictor.evaluate(test_data, silent=True)

{'accuracy': 0.9478,
 'balanced_accuracy': np.float64(0.754478262473782),
 'mcc': np.float64(0.9360368834449522)}

AutoGluon’s TabularPredictor also provides the leaderboard() function, which allows us to evaluate the performance of each individual trained model on the test data.

In [9]:
predictor.leaderboard(test_data)

Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,0.9478,0.964965,accuracy,3.526904,0.677187,107.066849,0.010747,0.001042,0.129622,2,True,14
1,LightGBM,0.9456,0.955956,accuracy,0.83771,0.14151,8.291406,0.83771,0.14151,8.291406,1,True,5
2,XGBoost,0.9448,0.956957,accuracy,2.002593,0.402028,13.941863,2.002593,0.402028,13.941863,1,True,11
3,LightGBMLarge,0.9444,0.94995,accuracy,2.154734,0.524598,14.659295,2.154734,0.524598,14.659295,1,True,13
4,CatBoost,0.9432,0.955956,accuracy,0.097386,0.013515,66.888672,0.097386,0.013515,66.888672,1,True,8
5,RandomForestEntr,0.9384,0.94995,accuracy,0.570432,0.106184,9.603348,0.570432,0.106184,9.603348,1,True,7
6,NeuralNetFastAI,0.9364,0.940941,accuracy,0.077106,0.025255,10.104249,0.077106,0.025255,10.104249,1,True,3
7,ExtraTreesGini,0.936,0.946947,accuracy,0.789398,0.115544,2.445461,0.789398,0.115544,2.445461,1,True,9
8,ExtraTreesEntr,0.9358,0.942943,accuracy,0.994625,0.137522,2.463628,0.994625,0.137522,2.463628,1,True,10
9,RandomForestGini,0.9352,0.944945,accuracy,0.41432,0.105955,6.265554,0.41432,0.105955,6.265554,1,True,6


# How It Works

AutoML is usually associated with Hyperparameter Optimization (HPO) of one or multiple models. Or with the automatic selection of models based on the given problem. In fact, most AutoML frameworks do this.

AutoGluon is different because it doesn’t rely on HPO to achieve great performance[1]. It’s based on three main principles: (1) training a variety of different models, (2) using bagging when training those models, and (3) stack-ensembling those models to combine their predictive power into a “super” model.

## Bagging
Bagging (Bootstrap Aggregation) is a technique used in Machine Learning to improve the stability and accuracy of algorithms. The key idea is that combining multiple models usually leads to better performance than any single model because it reduces overfitting and adds robustness to the prediction.

In general, the bootstrap portion of bagging involves taking many random sub-samples (with replacement, i.e. the same data point can appear more than once in a sample) from the training dataset. And then training different models on the bootstrapped samples.

However, AutoGluon performs bagging in a different way by combining it with cross-validation. In addition to the benefits of bagging, cross-validation allows us to train and validate multiple models using all the training data. This also increases our confidence in the scores of the trained models.

**Partitioning:** The training data is partitioned into K folds (or subsets of the dataset)

**Model Training:** For each of the folds, a model is trained using all the data except the fold. This means we train K separate model instances with different portions of the data. This is known as a bagged model.

**Cross-validation:** Each model instance is evaluated against the hold-out fold that wasn’t used during training. We then concatenate the predictions[3] from the folds to create the out-of-fold (OOF) predictions. We calculate the final model cross-validation score by computing the evaluation metric using the OOF predictions and the target ground truth. Make sure to form a solid understanding of what out-of-fold (OOF) predictions are, as they are the most critical component to making stack ensembling work (see below).

**Aggregation:** At prediction time, bagging takes all these individual models and averages their predictions to generate a final answer (e.g. the class in the case of classification problems).

## Stacked Ensembling
In the most general sense, ensembling is another technique used in Machine Learning to improve the accuracy of predictions by combining the strengths of multiple models.

AutoGluon, in particular, uses stack ensembling. At a high level, stack ensembling is a technique that leverages the predictions of models as extra features in the data.

**Layer(s) of Models:** Stacked ensembling is like a multi-layer cake. Each layer consists of several different bagged models (see above) that use the predictions from the previous layer as inputs (features) in addition to the original features from the training data (akin to a skip connection). The first layer (also known as the base) uses only the original features from the training data.

**Weighted Ensemble:** The last layer consists of a single “super” model that combines the predictions from the second to last layer[5]. The job of this model, commonly known as the meta-model, is to learn how to combine the outputs from all previous models to make a final prediction. Think of this model as a leader who makes a final decision by weighting everyone else’s inputs. In fact, that is exactly what AutoGluon does: it uses a Greedy Weighted Ensemble algorithm to produce the final meta-model.

**Residual Connections:** Note that the structure of stacked ensembling resembles that of a Neural Network. Therefore, advanced techniques (e.g. dropout, skip connections, etc.) used for Neural Networks could also be applied here as well.

**How to Train:** During training time it is critical to avoid data leakage, and therefore we use the out-of-fold (OOF) predictions of each bag model instead of predicting directly on the train data with the bagged model. By using out-of-fold predictions, we ensure that each instance in the training dataset has a corresponding prediction that was generated by a model that did not train on that instance. This setup mirrors how the final ensemble model will operate on new, unseen data.

**How to Infer:** During inference time, we don’t need to worry about data leakage, and we simply average the predictions of all models in a bag to generate its predictions.


# Advanced Usage

This section describes how you can exert greater control when using AutoGluon’s fit() or predict(). Recall that to maximize predictive performance, you should first try TabularPredictor() and fit() with all default arguments. Then, consider non-default arguments for TabularPredictor(eval_metric=...), and fit(presets=...). Later, you can experiment with other arguments to fit() covered in this in-depth tutorial like hyperparameter_tune_kwargs, hyperparameters, num_stack_levels, num_bag_folds, num_bag_sets, etc.

## Data Load

In [10]:
from autogluon.tabular import TabularDataset, TabularPredictor

import numpy as np

train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
subsample_size = 1000  # subsample subset of data for faster demo, try setting this to much larger values
train_data = train_data.sample(n=subsample_size, random_state=0)
print(train_data.head())

label = 'occupation'
print("Summary of occupation column: \n", train_data['occupation'].describe())

test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
y_test = test_data[label]
test_data_nolabel = test_data.drop(columns=[label])  # delete label column

metric = 'accuracy' # we specify eval-metric just for demo (unnecessary as it's the default)

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv | Columns = 15 / 15 | Rows = 39073 -> 39073


       age workclass  fnlwgt      education  education-num  \
6118    51   Private   39264   Some-college             10   
23204   58   Private   51662           10th              6   
29590   40   Private  326310   Some-college             10   
18116   37   Private  222450        HS-grad              9   
33964   62   Private  109190      Bachelors             13   

            marital-status        occupation    relationship    race      sex  \
6118    Married-civ-spouse   Exec-managerial            Wife   White   Female   
23204   Married-civ-spouse     Other-service            Wife   White   Female   
29590   Married-civ-spouse      Craft-repair         Husband   White     Male   
18116        Never-married             Sales   Not-in-family   White     Male   
33964   Married-civ-spouse   Exec-managerial         Husband   White     Male   

       capital-gain  capital-loss  hours-per-week  native-country   class  
6118              0             0              40   United-State

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769


## Specifying hyperparameters and tuning them

It first demonstrate hyperparameter-tuning and how you can provide your own validation dataset that AutoGluon internally relies on to: tune hyperparameters, early-stop iterative training, and construct model ensembles. One reason you may specify validation data is when future test data will stem from a different distribution than training data (and your specified validation data is more representative of the future data that will likely be encountered).

If you don’t have a strong reason to provide your own validation dataset, it recommend you omit the tuning_data argument. This lets AutoGluon automatically select validation data from your provided training set (it uses smart strategies such as stratified sampling). For greater control, you can specify the holdout_frac argument to tell AutoGluon what fraction of the provided training data to hold out for validation.

fit() trains neural networks and various types of tree ensembles by default. You can specify various hyperparameter values for each type of model. For each hyperparameter, you can either specify a single fixed value, or a search space of values to consider during hyperparameter optimization. Hyperparameters which you do not specify are left at default settings chosen automatically by AutoGluon, which may be fixed values or search spaces.

In [11]:
from autogluon.common import space

nn_options = {  # specifies non-default hyperparameter values for neural network models
    'num_epochs': 10,  # number of training epochs (controls training time of NN models)
    'learning_rate': space.Real(1e-4, 1e-2, default=5e-4, log=True),  # learning rate used in training (real-valued hyperparameter searched on log-scale)
    'activation': space.Categorical('relu', 'softrelu', 'tanh'),  # activation function used in NN (categorical hyperparameter, default = first entry)
    'dropout_prob': space.Real(0.0, 0.5, default=0.1),  # dropout probability (real-valued hyperparameter)
}

gbm_options = {  # specifies non-default hyperparameter values for lightGBM gradient boosted trees
    'num_boost_round': 100,  # number of boosting rounds (controls training time of GBM models)
    'num_leaves': space.Int(lower=26, upper=66, default=36),  # number of leaves in trees (integer hyperparameter)
}

hyperparameters = {  # hyperparameters of each model type
                   'GBM': gbm_options,
                   'NN_TORCH': nn_options,  # NOTE: comment this line out if you get errors on Mac OSX
                  }  # When these keys are missing from hyperparameters dict, no models of that type are trained

time_limit = 2*60  # train various models for ~2 min
num_trials = 5  # try at most 5 different hyperparameter configurations for each type of model
search_strategy = 'auto'  # to tune hyperparameters using random search routine with a local scheduler

hyperparameter_tune_kwargs = {  # HPO is not performed unless hyperparameter_tune_kwargs is specified
    'num_trials': num_trials,
    'scheduler' : 'local',
    'searcher': search_strategy,
}  # Refer to TabularPredictor.fit docstring for all valid values

predictor = TabularPredictor(label=label, eval_metric=metric).fit(
    train_data,
    time_limit=time_limit,
    hyperparameters=hyperparameters,
    hyperparameter_tune_kwargs=hyperparameter_tune_kwargs,
)

No path specified. Models will be saved in: "AutogluonModels/ag-20250621_093421"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.3.1
Python Version:     3.11.13
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP PREEMPT_DYNAMIC Sun Mar 30 16:01:29 UTC 2025
CPU Count:          2
Memory Avail:       10.52 GB / 12.67 GB (83.0%)
Disk Space Avail:   182.81 GB / 225.83 GB (81.0%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
	presets='best'         : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
	presets='high'         : Strong 

  0%|          | 0/5 [00:00<?, ?it/s]

Fitted model: LightGBM/T1 ...
	0.37	 = Validation score   (accuracy)
	0.79s	 = Training   runtime
	0.01s	 = Validation runtime
Fitted model: LightGBM/T2 ...
	0.355	 = Validation score   (accuracy)
	0.75s	 = Training   runtime
	0.01s	 = Validation runtime
Fitted model: LightGBM/T3 ...
	0.375	 = Validation score   (accuracy)
	0.56s	 = Training   runtime
	0.01s	 = Validation runtime
Fitted model: LightGBM/T4 ...
	0.36	 = Validation score   (accuracy)
	0.79s	 = Training   runtime
	0.03s	 = Validation runtime
Fitted model: LightGBM/T5 ...
	0.375	 = Validation score   (accuracy)
	0.72s	 = Training   runtime
	0.01s	 = Validation runtime
Hyperparameter tuning model: NeuralNetTorch ... Tuning model for up to 53.95s of the 115.8s of remaining time.


+---------------------------------------------------+
| Configuration for experiment     NeuralNetTorch   |
+---------------------------------------------------+
| Search algorithm                 SearchGenerator  |
| Scheduler                        FIFOScheduler    |
| Number of trials                 5                |
+---------------------------------------------------+

View detailed results here: /content/AutogluonModels/ag-20250621_093421/models/NeuralNetTorch


Fitted model: NeuralNetTorch/0d794a3a ...
	0.365	 = Validation score   (accuracy)
	5.04s	 = Training   runtime
	0.02s	 = Validation runtime
Fitted model: NeuralNetTorch/93b3e11e ...
	0.315	 = Validation score   (accuracy)
	6.72s	 = Training   runtime
	0.01s	 = Validation runtime
Fitted model: NeuralNetTorch/d45725bc ...
	0.32	 = Validation score   (accuracy)
	6.28s	 = Training   runtime
	0.02s	 = Validation runtime





Fitting model: WeightedEnsemble_L2 ... Training model for up to 119.88s of the 57.17s of remaining time.
	Ensemble Weights: {'LightGBM/T3': 1.0}
	0.375	 = Validation score   (accuracy)
	0.03s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 62.9s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 20169.3 rows/s (200 batch size)
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("/content/AutogluonModels/ag-20250621_093421")


We again demonstrate how to use the trained models to predict on the test data.

In [12]:
y_pred = predictor.predict(test_data_nolabel)
print("Predictions:  ", list(y_pred)[:5])
perf = predictor.evaluate(test_data, auxiliary_metrics=False)

Predictions:   [' Other-service', ' Craft-repair', ' Exec-managerial', ' Sales', ' Other-service']


Use the following to view a summary of what happened during fit(). Now this command will show details of the hyperparameter-tuning process for each type of model:

In [13]:
results = predictor.fit_summary()

*** Summary of fit() ***
Estimated performance of each model:
                     model  score_val eval_metric  pred_time_val  fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0              LightGBM/T3      0.375    accuracy       0.008905  0.555402                0.008905           0.555402            1       True          3
1      WeightedEnsemble_L2      0.375    accuracy       0.009916  0.590175                0.001011           0.034773            2       True          9
2              LightGBM/T5      0.375    accuracy       0.013941  0.719250                0.013941           0.719250            1       True          5
3              LightGBM/T1      0.370    accuracy       0.005238  0.791662                0.005238           0.791662            1       True          1
4  NeuralNetTorch/0d794a3a      0.365    accuracy       0.017779  5.043689                0.017779           5.043689            1       True          6
5              Light

In the above example, the predictive performance may be poor because we specified very little training to ensure quick runtimes. You can call fit() multiple times while modifying the above settings to better understand how these choices affect performance outcomes. For example: you can increase subsample_size to train using a larger dataset, increase the num_epochs and num_boost_round hyperparameters, and increase the time_limit (which you should do for all code in these tutorials). To see more detailed output during the execution of fit(), you can also pass in the argument: verbosity=3.

## Model ensembling with stacking/bagging
Beyond hyperparameter-tuning with a correctly-specified evaluation metric, two other methods to boost predictive performance are bagging and stack-ensembling. **You’ll often see performance improve if you specify num_bag_folds = 5-10, num_stack_levels = 1 in the call to fit()**, but this will increase training times and memory/disk usage.

In [14]:
label = 'class'  # Now lets predict the "class" column (binary classification)
test_data_nolabel = test_data.drop(columns=[label])
y_test = test_data[label]
save_path = 'agModels-predictClass'  # folder where to store trained models

predictor = TabularPredictor(label=label, eval_metric=metric).fit(train_data,
    num_bag_folds=5, num_bag_sets=1, num_stack_levels=1,
    hyperparameters = {'NN_TORCH': {'num_epochs': 2}, 'GBM': {'num_boost_round': 20}},  # last  argument is just for quick demo here, omit it in real applications
)

No path specified. Models will be saved in: "AutogluonModels/ag-20250621_094249"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.3.1
Python Version:     3.11.13
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP PREEMPT_DYNAMIC Sun Mar 30 16:01:29 UTC 2025
CPU Count:          2
Memory Avail:       10.19 GB / 12.67 GB (80.4%)
Disk Space Avail:   182.79 GB / 225.83 GB (80.9%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
	presets='best'         : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
	presets='high'         : Strong 

You should not provide tuning_data when stacking/bagging, and instead provide all your available data as train_data (which AutoGluon will split in more intellgent ways). num_bag_sets controls how many times the k-fold bagging process is repeated to further reduce variance (increasing this may further boost accuracy but will substantially increase training times, inference latency, and memory/disk usage). Rather than manually searching for good bagging/stacking values yourself, AutoGluon will automatically select good values for you if you specify auto_stack instead (which is used in the best_quality preset):

In [15]:
# Lets also specify the "balanced_accuracy" metric
predictor = TabularPredictor(label=label, eval_metric='balanced_accuracy', path=save_path).fit(
    train_data, auto_stack=True,
    calibrate_decision_threshold=False,  # Disabling for demonstration in next section
    hyperparameters={'FASTAI': {'num_epochs': 10}, 'GBM': {'num_boost_round': 200}}  # last 2 arguments are for quick demo, omit them in real applications
)
predictor.leaderboard(test_data)

Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.3.1
Python Version:     3.11.13
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP PREEMPT_DYNAMIC Sun Mar 30 16:01:29 UTC 2025
CPU Count:          2
Memory Avail:       10.18 GB / 12.67 GB (80.4%)
Disk Space Avail:   182.78 GB / 225.83 GB (80.9%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
	presets='best'         : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
	presets='high'         : Strong accuracy with fast inference speed.
	presets='good'         : Good accuracy with 

Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,LightGBM_BAG_L1,0.743784,0.776399,balanced_accuracy,1.118749,0.043838,40.150996,1.118749,0.043838,40.150996,1,True,1
1,WeightedEnsemble_L2,0.743784,0.776399,balanced_accuracy,1.124171,0.045162,40.207769,0.005423,0.001325,0.056773,2,True,3
2,NeuralNetFastAI_BAG_L1,0.724629,0.741368,balanced_accuracy,2.222456,0.103267,62.386574,2.222456,0.103267,62.386574,1,True,2


Often stacking/bagging will produce superior accuracy than hyperparameter-tuning, but you may try combining both techniques (**note: specifying presets='best_quality' in fit() simply sets auto_stack=True**).

## Decision Threshold Calibration
Major metric score improvements can be achieved in binary classification for metrics such as "f1" and "balanced_accuracy" by adjusting the prediction decision threshold via calibrate_decision_threshold to a value other than the default 0.5.

Below is an example of the "balanced_accuracy" score achieved on the test data with and without calibrating the decision threshold:

In [16]:
print(f'Prior to calibration (predictor.decision_threshold={predictor.decision_threshold}):')
scores = predictor.evaluate(test_data)

calibrated_decision_threshold = predictor.calibrate_decision_threshold()
predictor.set_decision_threshold(calibrated_decision_threshold)

print(f'After calibration (predictor.decision_threshold={predictor.decision_threshold}):')
scores_calibrated = predictor.evaluate(test_data)

Prior to calibration (predictor.decision_threshold=0.5):


Calibrating decision threshold to optimize metric balanced_accuracy | Checking 51 thresholds...
Calibrating decision threshold via fine-grained search | Checking 38 thresholds...
	Base Threshold: 0.500	| val: 0.7764
	Best Threshold: 0.250	| val: 0.7926
Updating predictor.decision_threshold from 0.5 -> 0.25
	This will impact how prediction probabilities are converted to predictions in binary classification.
	Prediction probabilities of the positive class >0.25 will be predicted as the positive class ( >50K). This can significantly impact metric scores.
	You can update this value via `predictor.set_decision_threshold`.
	You can calculate an optimal decision threshold on the validation data via `predictor.calibrate_decision_threshold()`.


After calibration (predictor.decision_threshold=0.25):


In [17]:
for metric_name in scores:
    metric_score = scores[metric_name]
    metric_score_calibrated = scores_calibrated[metric_name]
    decision_threshold = predictor.decision_threshold
    print(f'decision_threshold={decision_threshold:.3f}\t| metric="{metric_name}"'
          f'\n\ttest_score uncalibrated: {metric_score:.4f}'
          f'\n\ttest_score   calibrated: {metric_score_calibrated:.4f}'
          f'\n\ttest_score        delta: {metric_score_calibrated-metric_score:.4f}')

decision_threshold=0.250	| metric="balanced_accuracy"
	test_score uncalibrated: 0.7438
	test_score   calibrated: 0.8120
	test_score        delta: 0.0682
decision_threshold=0.250	| metric="accuracy"
	test_score uncalibrated: 0.8472
	test_score   calibrated: 0.8162
	test_score        delta: -0.0310
decision_threshold=0.250	| metric="mcc"
	test_score uncalibrated: 0.5457
	test_score   calibrated: 0.5654
	test_score        delta: 0.0197
decision_threshold=0.250	| metric="roc_auc"
	test_score uncalibrated: 0.8990
	test_score   calibrated: 0.8990
	test_score        delta: 0.0000
decision_threshold=0.250	| metric="f1"
	test_score uncalibrated: 0.6294
	test_score   calibrated: 0.6749
	test_score        delta: 0.0454
decision_threshold=0.250	| metric="precision"
	test_score uncalibrated: 0.7411
	test_score   calibrated: 0.5814
	test_score        delta: -0.1597
decision_threshold=0.250	| metric="recall"
	test_score uncalibrated: 0.5470
	test_score   calibrated: 0.8041
	test_score        delta: 0

Notice that calibrating for “balanced_accuracy” majorly improved the “balanced_accuracy” metric score, but it harmed the “accuracy” score. Threshold calibration will often result in a tradeoff between performance on different metrics, and the user should keep this in mind.

Instead of calibrating for “balanced_accuracy” specifically, we can calibrate for any metric if we want to maximize the score of that metric:

In [18]:
predictor.set_decision_threshold(0.5)  # Reset decision threshold
for metric_name in ['f1', 'balanced_accuracy', 'mcc']:
    metric_score = predictor.evaluate(test_data, silent=True)[metric_name]
    calibrated_decision_threshold = predictor.calibrate_decision_threshold(metric=metric_name, verbose=False)
    metric_score_calibrated = predictor.evaluate(
        test_data, decision_threshold=calibrated_decision_threshold, silent=True
    )[metric_name]
    print(f'decision_threshold={calibrated_decision_threshold:.3f}\t| metric="{metric_name}"'
          f'\n\ttest_score uncalibrated: {metric_score:.4f}'
          f'\n\ttest_score   calibrated: {metric_score_calibrated:.4f}'
          f'\n\ttest_score        delta: {metric_score_calibrated-metric_score:.4f}')

Updating predictor.decision_threshold from 0.25 -> 0.5
	This will impact how prediction probabilities are converted to predictions in binary classification.
	Prediction probabilities of the positive class >0.5 will be predicted as the positive class ( >50K). This can significantly impact metric scores.
	You can update this value via `predictor.set_decision_threshold`.
	You can calculate an optimal decision threshold on the validation data via `predictor.calibrate_decision_threshold()`.


decision_threshold=0.500	| metric="f1"
	test_score uncalibrated: 0.6294
	test_score   calibrated: 0.6294
	test_score        delta: 0.0000
decision_threshold=0.250	| metric="balanced_accuracy"
	test_score uncalibrated: 0.7438
	test_score   calibrated: 0.8120
	test_score        delta: 0.0682
decision_threshold=0.500	| metric="mcc"
	test_score uncalibrated: 0.5457
	test_score   calibrated: 0.5457
	test_score        delta: 0.0000


Instead of calibrating the decision threshold post-fit, you can have it **automatically occur during the fit call by specifying the fit parameter predictor.fit(..., calibrate_decision_threshold=True).**

Luckily, AutoGluon will automatically apply decision threshold calibration when beneficial, as the default value is calibrate_decision_threshold="auto". We recommend keeping this value as the default in most cases.

## Prediction options (inference)
Even if you’ve started a new Python session since last calling fit(), you can still load a previously trained predictor from disk:

In [19]:
predictor = TabularPredictor.load(save_path)  # `predictor.path` is another way to get the relative path needed to later load predictor.

Above save_path is the same folder previously passed to TabularPredictor, in which all the trained models have been saved. You can train easily models on one machine and deploy them on another. **Simply copy the save_path folder to the new machine and specify its new path in TabularPredictor.load().**

To find out the required feature columns to make predictions, call predictor.features():

In [20]:
predictor.features()

['age',
 'workclass',
 'fnlwgt',
 'education',
 'education-num',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'capital-gain',
 'capital-loss',
 'hours-per-week',
 'native-country']

We can make a prediction on an individual example rather than a full dataset:

In [21]:
datapoint = test_data_nolabel.iloc[[0]]  # Note: .iloc[0] won't work because it returns pandas Series instead of DataFrame
print(datapoint)
predictor.predict(datapoint)

   age workclass  fnlwgt education  education-num       marital-status  \
0   31   Private  169085      11th              7   Married-civ-spouse   

  occupation relationship    race      sex  capital-gain  capital-loss  \
0      Sales         Wife   White   Female             0             0   

   hours-per-week  native-country  
0              20   United-States  


Unnamed: 0,class
0,<=50K


To output predicted class probabilities instead of predicted classes, you can use:

In [22]:
predictor.predict_proba(datapoint)  # returns a DataFrame that shows which probability corresponds to which class

Unnamed: 0,<=50K,>50K
0,0.951059,0.048941


By default, predict() and predict_proba() will utilize the model that AutoGluon thinks is most accurate, which is usually an ensemble of many individual models. Here’s how to see which model this is:

In [23]:
predictor.model_best

'WeightedEnsemble_L2'

We can instead specify a particular model to use for predictions (e.g. to reduce inference latency). Note that a ‘model’ in AutoGluon may refer to, for example, a single Neural Network, a bagged ensemble of many Neural Network copies trained on different training/validation splits, a weighted ensemble that aggregates the predictions of many other models, or a stacker model that operates on predictions output by other models. This is akin to viewing a Random Forest as one ‘model’ when it is in fact an ensemble of many decision trees.

Before deciding which model to use, let’s evaluate all of the models AutoGluon has previously trained on our test data:

In [24]:
predictor.leaderboard(test_data)

Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,LightGBM_BAG_L1,0.743784,0.776399,balanced_accuracy,0.799699,0.043838,40.150996,0.799699,0.043838,40.150996,1,True,1
1,WeightedEnsemble_L2,0.743784,0.776399,balanced_accuracy,0.802311,0.045162,40.207769,0.002612,0.001325,0.056773,2,True,3
2,NeuralNetFastAI_BAG_L1,0.724629,0.741368,balanced_accuracy,1.713701,0.103267,62.386574,1.713701,0.103267,62.386574,1,True,2


The leaderboard shows each model’s predictive performance on the test data (score_test) and validation data (score_val), as well as the time required to: produce predictions for the test data (pred_time_val), produce predictions on the validation data (pred_time_val), and train only this model (fit_time). Below, we show that a leaderboard can be produced without new data (just uses the data previously reserved for validation inside fit) and can display extra information about each model:

In [25]:
predictor.leaderboard(extra_info=True)

Unnamed: 0,model,score_val,eval_metric,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order,...,hyperparameters,hyperparameters_fit,ag_args_fit,features,compile_time,child_hyperparameters,child_hyperparameters_fit,child_ag_args_fit,ancestors,descendants
0,LightGBM_BAG_L1,0.776399,balanced_accuracy,0.043838,40.150996,0.043838,40.150996,1,True,1,...,"{'use_orig_features': True, 'valid_stacker': True, 'max_base_models': 0, 'max_base_models_per_type': 'auto', 'save_bag_folds': True, 'stratify': 'auto', 'bin': 'auto', 'n_bins': None}",{},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': None, 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None, 'drop_unique': False}","[native-country, sex, race, hours-per-week, relationship, marital-status, fnlwgt, capital-gain, education-num, capital-loss, workclass, occupation, age, education]",,"{'learning_rate': 0.05, 'num_boost_round': 200}",{'num_boost_round': 83},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': ['bool', 'int', 'float', 'category'], 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None}",[],[WeightedEnsemble_L2]
1,WeightedEnsemble_L2,0.776399,balanced_accuracy,0.045162,40.207769,0.001325,0.056773,2,True,3,...,"{'use_orig_features': False, 'valid_stacker': True, 'max_base_models': 0, 'max_base_models_per_type': 'auto', 'save_bag_folds': True, 'stratify': 'auto', 'bin': 'auto', 'n_bins': None}",{},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': None, 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None, 'drop_unique': False}",[LightGBM_BAG_L1],,"{'ensemble_size': 25, 'subsample_size': 1000000}",{'ensemble_size': 1},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': None, 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None, 'drop_unique': False}",[LightGBM_BAG_L1],[]
2,NeuralNetFastAI_BAG_L1,0.741368,balanced_accuracy,0.103267,62.386574,0.103267,62.386574,1,True,2,...,"{'use_orig_features': True, 'valid_stacker': True, 'max_base_models': 0, 'max_base_models_per_type': 'auto', 'save_bag_folds': True, 'stratify': 'auto', 'bin': 'auto', 'n_bins': None}",{},"{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': None, 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None, 'drop_unique': False}","[native-country, sex, race, hours-per-week, relationship, marital-status, fnlwgt, capital-gain, education-num, capital-loss, workclass, occupation, age, education]",,"{'layers': None, 'emb_drop': 0.1, 'ps': 0.1, 'bs': 'auto', 'lr': 0.01, 'epochs': 'auto', 'early.stopping.min_delta': 0.0001, 'early.stopping.patience': 20, 'smoothing': 0.0, 'num_epochs': 10}","{'epochs': 30, 'best_epoch': 9}","{'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': ['bool', 'int', 'float', 'category'], 'valid_special_types': None, 'ignored_type_group_special': ['text_ngram', 'text_as_category'], 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None}",[],[]


The expanded leaderboard shows properties like how many features are used by each model (num_features), which other models are ancestors whose predictions are required inputs for each model (ancestors), and how much memory each model and all its ancestors would occupy if simultaneously persisted (memory_size_w_ancestors).

To show scores for other metrics, you can specify the extra_metrics argument when passing in test_data:

In [26]:
predictor.leaderboard(test_data, extra_metrics=['accuracy', 'balanced_accuracy', 'log_loss'])

Unnamed: 0,model,score_test,accuracy,balanced_accuracy,log_loss,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,LightGBM_BAG_L1,0.743784,0.84717,0.743784,-0.334022,0.776399,balanced_accuracy,0.816168,0.043838,40.150996,0.816168,0.043838,40.150996,1,True,1
1,WeightedEnsemble_L2,0.743784,0.84717,0.743784,-0.334022,0.776399,balanced_accuracy,0.817812,0.045162,40.207769,0.001644,0.001325,0.056773,2,True,3
2,NeuralNetFastAI_BAG_L1,0.724629,0.843792,0.724629,-0.343404,0.741368,balanced_accuracy,1.273144,0.103267,62.386574,1.273144,0.103267,62.386574,1,True,2


Notice that log_loss scores are negative. This is because metrics in AutoGluon are always shown in higher_is_better form. This means that metrics such as log_loss and root_mean_squared_error will have their signs FLIPPED, and values will be negative. This is necessary to avoid the user needing to know the metric to understand if higher is better when looking at leaderboard.

**One additional caveat:** It is possible that log_loss values can be -inf when computed via extra_metrics. This is because the models were not optimized with log_loss in mind during training and may have prediction probabilities giving a class 0 (particularly common with K-Nearest-Neighbors models). Because log_loss gives infinite error when the correct class was given 0 probability, this results in a score of -inf. It is therefore recommended that log_loss should not be used as a secondary metric to determine model quality. Either use log_loss as the eval_metric or avoid it altogether.

Here’s how to specify a particular model to use for prediction instead of AutoGluon’s default model-choice:

In [27]:
i = 0  # index of model to use
model_to_use = predictor.model_names()[i]
model_pred = predictor.predict(datapoint, model=model_to_use)
print("Prediction from %s model: %s" % (model_to_use, model_pred.iloc[0]))

Prediction from LightGBM_BAG_L1 model:  <=50K


We can easily access various information about the trained predictor or a particular model:

In [28]:
all_models = predictor.model_names()
model_to_use = all_models[i]
specific_model = predictor._trainer.load_model(model_to_use)

# Objects defined below are dicts of various information (not printed here as they are quite large):
model_info = specific_model.get_info()
predictor_information = predictor.info()

The predictor also remembers what metric predictions should be evaluated with, which can be done with ground truth labels as follows:

In [29]:
y_pred_proba = predictor.predict_proba(test_data_nolabel)
perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred_proba)
perf

{'balanced_accuracy': np.float64(0.7437840946238461),
 'accuracy': 0.847169618179957,
 'mcc': np.float64(0.5457085071194441),
 'roc_auc': np.float64(0.8989524774398953),
 'f1': 0.6294365847604865,
 'precision': 0.7410870835768556,
 'recall': 0.5470232959447799}

Since the label columns remains in the test_data DataFrame, we can instead use the shorthand:

In [30]:
perf = predictor.evaluate(test_data)
perf

{'balanced_accuracy': np.float64(0.7437840946238461),
 'accuracy': 0.847169618179957,
 'mcc': np.float64(0.5457085071194441),
 'roc_auc': np.float64(0.8989524774398953),
 'f1': 0.6294365847604865,
 'precision': 0.7410870835768556,
 'recall': 0.5470232959447799}

## Feature Importance

To better understand our trained predictor, we can estimate the overall importance of each feature:

In [31]:
predictor.feature_importance(test_data)

Computing feature importance via permutation shuffling for 14 features using 5000 rows with 5 shuffle sets...
	35.19s	= Expected runtime (7.04s per shuffle set)
	34.45s	= Actual runtime (Completed 5 of 5 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
marital-status,0.068704,0.004542,2.279366e-06,5,0.078057,0.059352
capital-gain,0.046431,0.002457,9.369035e-07,5,0.051489,0.041372
education-num,0.042721,0.003485,5.268617e-06,5,0.049898,0.035545
age,0.035115,0.005922,9.348413e-05,5,0.047308,0.022922
occupation,0.033699,0.00789,0.0003356604,5,0.049945,0.017454
relationship,0.014965,0.003663,0.0003983866,5,0.022507,0.007423
hours-per-week,0.01227,0.00375,0.0009287608,5,0.019992,0.004548
capital-loss,0.002217,0.00126,0.008531892,5,0.004812,-0.000378
education,0.000319,0.000774,0.2045314,5,0.001912,-0.001274
native-country,0.0,0.0,0.5,5,0.0,0.0


Computed via permutation-shuffling, these feature importance scores quantify the drop in predictive performance (of the already trained predictor) when one column’s values are randomly shuffled across rows. The top features in this list contribute most to AutoGluon’s accuracy (for predicting when/if a patient will be readmitted to the hospital). **Features with non-positive importance score hardly contribute to the predictor’s accuracy, or may even be actively harmful to include in the data (consider removing these features from your data and calling fit again).** These scores facilitate interpretability of the predictor’s global behavior (which features it relies on for all predictions).

## Feature Engineering

Feature engineering involves taking raw tabular data andconverting it into a format ready for the machine learning model to read trying to enhance some columns (‘features’ in ML jargon) to give the ML models more information, hoping to get more accurate results.

AutoGluon does some of this for you. By default a feature generator called AutoMLPipelineFeatureGenerator is used. Let’s see this in action. We’ll create a dataframe containing a floating point column, an integer column, a datetime column, a categorical column. We’ll first take a look at the raw data we created.

In [33]:
from autogluon.tabular import TabularDataset, TabularPredictor
import pandas as pd
import numpy as np
import random
from sklearn.datasets import make_regression
from datetime import datetime

x, y = make_regression(n_samples = 100,n_features = 5,n_targets = 1, random_state = 1)
dfx = pd.DataFrame(x, columns=['A','B','C','D','E'])
dfy = pd.DataFrame(y, columns=['label'])

# Create an integer column, a datetime column, a categorical column and a string column to demonstrate how they are processed.
dfx['B'] = (dfx['B']).astype(int)
dfx['C'] = datetime(2000,1,1) + pd.to_timedelta(dfx['C'].astype(int), unit='D')
dfx['D'] = pd.cut(dfx['D'] * 10, [-np.inf,-5,0,5,np.inf],labels=['v','w','x','y'])
dfx['E'] = pd.Series(list(' '.join(random.choice(["abc", "d", "ef", "ghi", "jkl"]) for i in range(4)) for j in range(100)))
dataset=TabularDataset(dfx)
print(dfx)

           A  B          C  D                E
0  -0.545774  0 2000-01-01  y    ghi ef ef abc
1  -0.468674  0 2000-01-02  x    ef abc abc ef
2   1.767960  0 1999-12-31  v      abc ef ef d
3  -0.118771  1 2000-01-01  y     abc d ef abc
4   0.630196  0 1999-12-31  w      jkl d d abc
..       ... ..        ... ..              ...
95 -1.182318 -1 2000-01-01  v       d ef ef ef
96  0.562761  0 2000-01-01  v   ghi abc jkl ef
97 -0.797270  0 2000-01-01  w     jkl jkl d ef
98  0.502741  0 1999-12-31  y  ghi jkl jkl ghi
99  2.056356  0 1999-12-30  w      jkl ef ef d

[100 rows x 5 columns]


Now let’s call the default feature generator AutoMLPipeLineFeatureGenerator with no parameters and see what it does.

In [35]:
from autogluon.features.generators import AutoMLPipelineFeatureGenerator
auto_ml_pipeline_feature_generator = AutoMLPipelineFeatureGenerator()
auto_ml_pipeline_feature_generator.fit_transform(X=dfx)

Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    10379.52 MB
	Train Data (Original)  Memory Usage: 0.01 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
		Fitting DatetimeFeatureGenerator...
		Fitting TextSpecialFeatureGenerator...
			Fitting BinnedFeatureGenerator...
			Fitting DropDuplicatesFeatureGenerator...
		Fitting TextNgramFeatureGenerator...
			Fitting CountVectorizer for text features: ['E']
			CountVectorizer fit with vocabulary size = 4
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Stage 5 Generators:
		Fitting DropDuplicatesFeatureGenerator...
	Types o

Unnamed: 0,A,B,D,E,C,C.year,C.month,C.day,C.dayofweek,E.char_count,E.symbol_ratio.,__nlp__.abc,__nlp__.ef,__nlp__.ghi,__nlp__.jkl,__nlp__._total_
0,-0.545774,0,3,,946684800000000000,2000,1,1,5,4,2,1,2,1,0,3
1,-0.468674,0,2,,946771200000000000,2000,1,2,6,4,2,2,2,0,0,2
2,1.767960,0,0,1,946598400000000000,1999,12,31,4,2,4,1,2,0,0,2
3,-0.118771,1,3,0,946684800000000000,2000,1,1,5,3,3,2,1,0,0,2
4,0.630196,0,1,6,946598400000000000,1999,12,31,4,2,4,1,0,0,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,-1.182318,-1,0,,946684800000000000,2000,1,1,5,1,5,0,3,0,0,1
96,0.562761,0,0,,946684800000000000,2000,1,1,5,5,1,1,1,1,1,4
97,-0.797270,0,1,,946684800000000000,2000,1,1,5,3,3,0,1,0,2,2
98,0.502741,0,3,,946598400000000000,1999,12,31,4,6,0,0,0,2,2,2


We can see that:

* The floating point and integer columns ‘A’ and ‘B’ are unchanged.

* The datetime column ‘C’ has been converted to a raw value (in nanoseconds), as well as parsed into additional columns for the year, month, day and dayofweek.

* The string categorical column ‘D’ has been mapped 1:1 to integers - a lot of models only accept numerical input.

* The freeform text column has been mapped into some summary features (‘char_count’ etc) as well as a N-hot matrix saying whether each text contained each word.

To get more details, we should call the pipeline as part of TabularPredictor.fit(). We need to combine the dfx and dfy DataFrames since fit() expects a single dataframe.

In [36]:
df = pd.concat([dfx, dfy], axis=1)
predictor = TabularPredictor(label='label')
predictor.fit(df, hyperparameters={'GBM' : {}}, feature_generator=auto_ml_pipeline_feature_generator)

No path specified. Models will be saved in: "AutogluonModels/ag-20250621_104741"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.3.1
Python Version:     3.11.13
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP PREEMPT_DYNAMIC Sun Mar 30 16:01:29 UTC 2025
CPU Count:          2
Memory Avail:       10.13 GB / 12.67 GB (79.9%)
Disk Space Avail:   182.77 GB / 225.83 GB (80.9%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
	presets='best'         : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
	presets='high'         : Strong 

<autogluon.tabular.predictor.predictor.TabularPredictor at 0x7881df96c550>

Reading the output, note that:

* the string-categorical column ‘D’, despite being mapped to integers, is still recognised as categorical.

* the integer column ‘B’ has not been identified as categorical, even though it only has a few unique values:

In [37]:
print(len(set(dfx['B'])))

5


To mark it as categorical, we can explicitly mark it as categorical in the original dataframe:

In [38]:
dfx["B"] = dfx["B"].astype("category")
auto_ml_pipeline_feature_generator = AutoMLPipelineFeatureGenerator()
auto_ml_pipeline_feature_generator.fit_transform(X=dfx)

Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    10373.31 MB
	Train Data (Original)  Memory Usage: 0.01 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
		Fitting DatetimeFeatureGenerator...
		Fitting TextSpecialFeatureGenerator...
			Fitting BinnedFeatureGenerator...
			Fitting DropDuplicatesFeatureGenerator...
		Fitting TextNgramFeatureGenerator...
			Fitting CountVectorizer for text features: ['E']
			CountVectorizer fit with vocabulary size = 4
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Stage 5 Generators:
		Fitting DropDuplicatesFeatureGenerator...
	Types o

Unnamed: 0,A,B,D,E,C,C.year,C.month,C.day,C.dayofweek,E.char_count,E.symbol_ratio.,__nlp__.abc,__nlp__.ef,__nlp__.ghi,__nlp__.jkl,__nlp__._total_
0,-0.545774,1,3,,946684800000000000,2000,1,1,5,4,2,1,2,1,0,3
1,-0.468674,1,2,,946771200000000000,2000,1,2,6,4,2,2,2,0,0,2
2,1.767960,1,0,1,946598400000000000,1999,12,31,4,2,4,1,2,0,0,2
3,-0.118771,2,3,0,946684800000000000,2000,1,1,5,3,3,2,1,0,0,2
4,0.630196,1,1,6,946598400000000000,1999,12,31,4,2,4,1,0,0,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,-1.182318,0,0,,946684800000000000,2000,1,1,5,1,5,0,3,0,0,1
96,0.562761,1,0,,946684800000000000,2000,1,1,5,5,1,1,1,1,1,4
97,-0.797270,1,1,,946684800000000000,2000,1,1,5,3,3,0,1,0,2,2
98,0.502741,1,3,,946598400000000000,1999,12,31,4,6,0,0,0,2,2,2


## Missing Value Handling
To illustrate missing value handling, let’s set the first row to all NaNs:

In [39]:
dfx.iloc[0] = np.nan
dfx.head()

Unnamed: 0,A,B,C,D,E
0,,,NaT,,
1,-0.468674,0.0,2000-01-02,x,ef abc abc ef
2,1.76796,0.0,1999-12-31,v,abc ef ef d
3,-0.118771,1.0,2000-01-01,y,abc d ef abc
4,0.630196,0.0,1999-12-31,w,jkl d d abc


Now if we reprocess:

In [40]:
auto_ml_pipeline_feature_generator = AutoMLPipelineFeatureGenerator()
auto_ml_pipeline_feature_generator.fit_transform(X=dfx)

Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    10341.13 MB
	Train Data (Original)  Memory Usage: 0.01 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
		Fitting DatetimeFeatureGenerator...
		Fitting TextSpecialFeatureGenerator...
			Fitting BinnedFeatureGenerator...
			Fitting DropDuplicatesFeatureGenerator...
		Fitting TextNgramFeatureGenerator...
			Fitting CountVectorizer for text features: ['E']
			CountVectorizer fit with vocabulary size = 4
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Stage 5 Generators:
		Fitting DropDuplicatesFeatureGenerator...
	Types o

Unnamed: 0,A,B,D,E,C,C.year,C.month,C.day,C.dayofweek,E.char_count,E.word_count,E.symbol_ratio.,__nlp__.abc,__nlp__.ef,__nlp__.ghi,__nlp__.jkl,__nlp__._total_
0,,,,,946687418181818240,2000,1,1,5,0,0,0,0,0,0,0,0
1,-0.468674,1,2,,946771200000000000,2000,1,2,6,5,1,3,2,2,0,0,2
2,1.767960,1,0,1,946598400000000000,1999,12,31,4,3,1,5,1,2,0,0,2
3,-0.118771,2,3,0,946684800000000000,2000,1,1,5,4,1,4,2,1,0,0,2
4,0.630196,1,1,6,946598400000000000,1999,12,31,4,3,1,5,1,0,0,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,-1.182318,0,0,,946684800000000000,2000,1,1,5,2,1,6,0,3,0,0,1
96,0.562761,1,0,,946684800000000000,2000,1,1,5,6,1,2,1,1,1,1,4
97,-0.797270,1,1,,946684800000000000,2000,1,1,5,4,1,4,0,1,0,2,2
98,0.502741,1,3,,946598400000000000,1999,12,31,4,7,1,1,0,0,2,2,2


We see that the floating point, integer, categorical and text fields ‘A’, ‘B’, ‘D’, and ‘E’ have retained the NaNs, but the datetime column ‘C’ has been set to the mean of the non-NaN values.

## Customization of Feature Engineering
To customize your feature generation pipeline, it is recommended to call PipelineFeatureGenerator, passing in non-default parameters to other feature generators as required. For example, if we think downstream models would benefit from removing rare categorical values and replacing with NaN, we can supply the parameter maximum_num_cat to CategoryFeatureGenerator, as below:

In [41]:
from autogluon.features.generators import PipelineFeatureGenerator, CategoryFeatureGenerator, IdentityFeatureGenerator
from autogluon.common.features.types import R_INT, R_FLOAT
mypipeline = PipelineFeatureGenerator(
    generators = [[
        CategoryFeatureGenerator(maximum_num_cat=10),  # Overridden from default.
        IdentityFeatureGenerator(infer_features_in_args=dict(valid_raw_types=[R_INT, R_FLOAT])),
    ]]
)

If we then dump out the transformed data, we can see that all columns have been converted to numeric, because that’s what most models require, and the rare categorical values have been replaced with NaN:

In [42]:
mypipeline.fit_transform(X=dfx)

Fitting PipelineFeatureGenerator...
	Available Memory:                    10353.35 MB
	Train Data (Original)  Memory Usage: 0.01 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
		Fitting IdentityFeatureGenerator...
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Stage 5 Generators:
		Fitting DropDuplicatesFeatureGenerator...
	Unused Original Features (Count: 1): ['C']
		These features were not used to generate any of the output features. Add a feature generator compatible with these features to utilize them.
		Features can also be unused if they carry very little information, such as being categorical but having almost entirely uniqu

Unnamed: 0,B,D,E,A
0,,,,
1,1,2,,-0.468674
2,1,0,1,1.767960
3,2,3,0,-0.118771
4,1,1,6,0.630196
...,...,...,...,...
95,0,0,,-1.182318
96,1,0,,0.562761
97,1,1,,-0.797270
98,1,3,,0.502741


For more on custom feature engineering, see the detailed notebook:
https://github.com/autogluon/autogluon/blob/master/examples/tabular/example_custom_feature_generator.py