Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tabular] Convert ColumnTransformer in tabular NN from sklearn to onnx #2503

Merged
merged 5 commits into from
Dec 20, 2022

Conversation

liangfu
Copy link
Collaborator

@liangfu liangfu commented Dec 1, 2022

Issue #, if available:

Description of changes:

This PR converts the ColumnsTransformer in tabular NN model from sklearn to onnx, so that the processor in TabularNeuralNetTorchModel can be accelerated via onnxruntime.

While the impact to accuracy is minimal,

NeuralNetTorch:
{'accuracy': 0.8365236974101751, 'balanced_accuracy': 0.7532995553694549, 'mcc': 0.5305019485882506, 'roc_auc': 0.8815480292353528, 'f1': 0.6332950631458094, 'precision': 0.6769759450171822, 'recall': 0.594909404659189}
NeuralNetTorch_ONNX:
{'accuracy': 0.8362166035418159, 'balanced_accuracy': 0.7523552495805498, 'mcc': 0.5291951976581161, 'roc_auc': 0.8813204856717612, 'f1': 0.6320147194112237, 'precision': 0.6768472906403941, 'recall': 0.5927523727351165}

, the impact to overall online inference is significant.

The relative speedups comparing to sklearn baseline are shown in the following figure.

image

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@liangfu
Copy link
Collaborator Author

liangfu commented Dec 1, 2022

Here is the script to demonstrate the acceleration.

import os, shutil
import numpy as np
import pandas as pd
import autogluon.core as ag
from autogluon.tabular import TabularDataset, TabularPredictor
from autogluon.core.utils.infer_utils import get_model_true_infer_speed_per_row_batch

def main():

    # Training time:
    train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')  # can be local CSV file as well, returns Pandas DataFrame
    test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')  # another Pandas DataFrame
    label = 'class'  # specifies which column do we want to predict
    save_path = 'ag_hpo_models/'  # where to save trained models
    compiled_path = 'ag_hpo_models_compiled/'

    subsample_size = 1000  # subsample subset of data for faster demo, try setting this to much larger values
    if subsample_size is not None and subsample_size < len(train_data):
        train_data = train_data.sample(n=subsample_size, random_state=0)

    hyperparameters = {
        'NN_TORCH': {},
    }

    predictor = TabularPredictor(label=label, path=save_path).fit(
        train_data, hyperparameters=hyperparameters,
        fit_weighted_ensemble=False,
    )
    predictor.save()

    repeats = 10
    batch_sizes = [
        1,
        10,
        100,
        1000,
        10000,
        100000,
    ]

    infer_dfs = dict()
    infer_transform_dfs = dict()
    silent=False
    test_data = test_data.head(10)

    for batch_size in batch_sizes:
        predictor = TabularPredictor.load(path=save_path)
        if os.path.exists(compiled_path):
            shutil.rmtree(compiled_path)
        predictor.clone(path=compiled_path)

        predictor = TabularPredictor.load(path=compiled_path)

        infer_df, time_per_row_transform = get_model_true_infer_speed_per_row_batch(
            data=test_data,
            predictor=predictor,
            batch_size=batch_size,
            repeats=repeats,
            persist_models=True,
            silent=silent)

        predictor = TabularPredictor.load(path=compiled_path)
        predictor.compile_models(models='all', compiler_configs={
            "NN_TORCH": {"compiler": 'onnx', 'batch_size': min(4096, batch_size)},
            "RF": {"compiler": 'onnx'},
        })

        infer_compiled_df, _ = get_model_true_infer_speed_per_row_batch(
            data=test_data,
            predictor=predictor,
            batch_size=batch_size,
            repeats=repeats,
            persist_models=True,
            silent=silent)

        infer_df = infer_df.reset_index()
        infer_compiled_df = infer_compiled_df.reset_index()
        for idx, name in enumerate(infer_compiled_df['model']):
            infer_compiled_df['model'][idx] = name + "_ONNX"
        infer_df = pd.concat([infer_df, infer_compiled_df])
        infer_dfs[batch_size] = infer_df
        infer_transform_dfs[batch_size] = time_per_row_transform

    for key in infer_dfs.keys():
        infer_dfs[key] = infer_dfs[key].reset_index()
        infer_dfs[key]['batch_size'] = key

    infer_df_full_transform = pd.Series(infer_transform_dfs, name='pred_time_test').to_frame().rename_axis('batch_size')
    infer_df_full_transform['pred_time_test_marginal'] = infer_df_full_transform['pred_time_test']
    infer_df_full_transform['pred_time_test_with_transform'] = infer_df_full_transform['pred_time_test']
    infer_df_full_transform = infer_df_full_transform.reset_index()

    infer_df_full = pd.concat([infer_dfs[key] for key in infer_dfs.keys()])

    infer_df_full_transform_include = infer_df_full_transform.copy()
    infer_df_full_transform_include['model'] = 'transform_features'
    infer_df_full = pd.concat([infer_df_full, infer_df_full_transform_include])

    infer_df_full = infer_df_full.sort_values(by=['batch_size'])
    infer_df_full = infer_df_full.reset_index(drop=True)

    print(infer_df_full)

    infer_df_full['rows_per_second'] = 1 / infer_df_full['pred_time_test']

    import matplotlib.pyplot as plt

    fig, ax = plt.subplots()
    fig.set_size_inches(12, 12)
    fig.set_dpi(300)

    plt.xscale('log')
    plt.yscale('log')

    models = list(infer_df_full['model'].unique())
    batch_sizes = list(infer_df_full['batch_size'].unique())
    for model in models:
        infer_df_model = infer_df_full[infer_df_full['model'] == model]
        ax.plot(infer_df_model['batch_size'].values, infer_df_model['rows_per_second'].values, label=model)

    ax.set(xlabel='batch_size', ylabel='rows_per_second',
           title='Rows per second inference throughput by data batch_size (AdultIncome)')
    ax.grid()
    ax.legend()
    fig.savefig('infer_speed.png', dpi=300)
    plt.show()

if __name__=='__main__':
    main()

@github-actions
Copy link

github-actions bot commented Dec 3, 2022

Job PR-2503-b00289b is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2503/b00289b/index.html

@github-actions
Copy link

github-actions bot commented Dec 5, 2022

Job PR-2503-3a80195 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2503/3a80195/index.html

@Innixma
Copy link
Contributor

Innixma commented Dec 5, 2022

Why test_data = test_data.head(10) in script?

@liangfu
Copy link
Collaborator Author

liangfu commented Dec 5, 2022

Why test_data = test_data.head(10) in script?

Nice catch, we can safely remove this line now. It was used for testing subset of the dataset that doesn't contain NaN.

@github-actions
Copy link

github-actions bot commented Dec 5, 2022

Job PR-2503-1572b1b is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2503/1572b1b/index.html

@github-actions
Copy link

github-actions bot commented Dec 5, 2022

Job PR-2503-ece6d3a is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2503/ece6d3a/index.html

@liangfu liangfu force-pushed the column-transformer-1 branch 4 times, most recently from a0e1158 to ce0e147 Compare December 14, 2022 23:22
@github-actions
Copy link

Job PR-2503-ce0e147 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2503/ce0e147/index.html

Copy link
Contributor

@Innixma Innixma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I can tell a lot of effort went into the specialized onnx compiler functions. Very nice speedup!

I was able to reproduce on my local machine with the provided script.

@github-actions
Copy link

Job PR-2503-713b4cf is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2503/713b4cf/index.html

@liangfu liangfu merged commit 3f0e36d into autogluon:master Dec 20, 2022
@liangfu liangfu deleted the column-transformer-1 branch December 20, 2022 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants