Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tabular: Accelerate boolean preprocessing #2944

Merged
merged 6 commits into from
Feb 21, 2023

Conversation

Innixma
Copy link
Contributor

@Innixma Innixma commented Feb 17, 2023

Issue #, if available:

Description of changes:

  • Accelerate boolean preprocessing

In particular, when many boolean features are present, this logic massively speeds up preprocessing in online and batch inference. Note that there is no impact on datasets with fewer than 15 boolean features, as the overhead of the pd.concat operation exceeds the benefits of the optimization.

For example, the KDDCup09-Upselling dataset has 15,000 columns. Of these columns, over 5000 are boolean features:

	Types of features in original data (exact raw dtype, raw dtype):
		('float64', 'float') :  1287 | ['Var7', 'Var12', 'Var22', 'Var28', 'Var47', ...]
		('int64', 'int')     :  1633 | ['Var11', 'Var21', 'Var49', 'Var78', 'Var82', ...]
		('object', 'object') :   188 | ['Var14741', 'Var14742', 'Var14743', 'Var14745', 'Var14746', ...]
		('uint8', 'int')     : 10338 | ['Var1', 'Var2', 'Var3', 'Var4', 'Var5', ...]
	Types of features in original data (raw dtype, special dtypes):
		('float', [])  :  1287 | ['Var7', 'Var12', 'Var22', 'Var28', 'Var47', ...]
		('int', [])    : 11971 | ['Var1', 'Var2', 'Var3', 'Var4', 'Var5', ...]
		('object', []) :   188 | ['Var14741', 'Var14742', 'Var14743', 'Var14745', 'Var14746', ...]
	Types of features in processed data (exact raw dtype, raw dtype):
		('category', 'category') :  157 | ['Var14742', 'Var14745', 'Var14746', 'Var14747', 'Var14748', ...]
		('float64', 'float')     : 1260 | ['Var7', 'Var12', 'Var22', 'Var28', 'Var47', ...]
		('int64', 'int')         : 1612 | ['Var11', 'Var21', 'Var49', 'Var78', 'Var82', ...]
		('int8', 'int')          : 5806 | ['Var5', 'Var6', 'Var10', 'Var14', 'Var16', ...]
		('uint8', 'int')         : 4611 | ['Var1', 'Var2', 'Var3', 'Var4', 'Var8', ...]
	Types of features in processed data (raw dtype, special dtypes):
		('category', [])  :  157 | ['Var14742', 'Var14745', 'Var14746', 'Var14747', 'Var14748', ...]
		('float', [])     : 1260 | ['Var7', 'Var12', 'Var22', 'Var28', 'Var47', ...]
		('int', [])       : 6223 | ['Var1', 'Var2', 'Var3', 'Var4', 'Var8', ...]
		('int', ['bool']) : 5806 | ['Var5', 'Var6', 'Var10', 'Var14', 'Var16', ...]

Performance comparison

On KDDCup09-Upselling, fit time preprocessing is sped up by 5x (445 s -> 72 s) and inference speed is sped up by 6x (batch_size=1) and 20x (batch_size=10000)

Mainline:

Data preprocessing and feature engineering runtime = 445.02s ...
Throughput for batch_size=1:
	2.359s per row | transform_features
Throughput for batch_size=10:
	0.248s per row | transform_features
Throughput for batch_size=100:
	0.029s per row | transform_features
Throughput for batch_size=1000:
        6.201ms per row | transform_features
Throughput for batch_size=10000:
	4.24ms per row | transform_features

This PR:

Data preprocessing and feature engineering runtime = 71.93s ...
Throughput for batch_size=1:
	0.4s per row | transform_features
Throughput for batch_size=10:
	0.045s per row | transform_features
Throughput for batch_size=100:
	0.011s per row | transform_features
Throughput for batch_size=1000:
	1.404ms per row | transform_features
Throughput for batch_size=10000:
	0.244ms per row | transform_features

TODO:

  • Benchmark on AMLB

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@Innixma Innixma added enhancement New feature or request module: tabular labels Feb 17, 2023
@Innixma Innixma added this to the 0.7 Fast-Follow Items milestone Feb 17, 2023
@github-actions
Copy link

Job PR-2944-f400930 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2944/f400930/index.html

Copy link
Collaborator

@liangfu liangfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, really significant speed up on boolean feature processing!

@github-actions
Copy link

Job PR-2944-1039341 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2944/1039341/index.html

@Innixma Innixma modified the milestones: 0.7 Fast-Follow Items, 0.7.1 Release Feb 21, 2023
@Innixma
Copy link
Contributor Author

Innixma commented Feb 21, 2023

Benchmark results on AMLB show major speedups in inference throughput for batch size 1 (also major for batch size 10000)

For example, QSAR-TID-11 had a 37.8x speedup

The only dataset that meaningfully became slower was arcene, becoming slower by 3.5x. This is because arcene has 100 boolean features, and 10,000 features overall. Because of this, despite having 100 boolean features, it only makes up 1% of the total features, and thus the in-place edits were faster than the new approach used in this PR. While we could probably make edits to avoid this slow-down, it is clear that in general this PR speeds things up, and we can revisit arcene to speed it up in future if we think it is useful.

Below is the batch_size_1 comparison:

dataset framework time_infer_s_rescaled
QSAR-TID-11 AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 37.8825899
QSAR-TID-10980 AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 37.4026658
cnae-9 AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 22.7168149
amazon-commerce-reviews AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 16.4769365
Bioresponse AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 14.9046798
Mercedes_Benz_Greener_Manufacturing AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 8.25304093
Internet-Advertisements AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 7.70006037
KDDCup09-Upselling AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 5.27209839
dna AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 4.71723012
jasmine AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 3.81948861
arcene AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 3.57990337
ada AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 3.48015793
micro-mass AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.99916776
yprop_4_1 AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.97469511
Allstate_Claims_Severity AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.40451882
Airlines_DepDelay_10M AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.34000561
eucalyptus AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.25881412
colleges AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.15041293
house_prices_nominal AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.12081512
helena AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.1008263
black_friday AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.08037571
yeast AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.07145099
SAT11-HAND-runtime-regression AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.07120994
Higgs AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.06378258
MiniBooNE AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.06123804
jannis AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.05948484
phoneme AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.05699168
blood-transfusion-service-center AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.04814805
Click_prediction_small AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.0454627
gina AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.04266587
madeline AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.04172659
okcupid-stem AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.04092817
APSFailure AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.03990475
volkert AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.03962174
diamonds AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.038902
topo_2_1 AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.03868574
space_ga AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.03800914
segment AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.03656174
Buzzinsocialmedia_Twitter AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.0365023
Fashion-MNIST AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.03368316
first-order-theorem-proving AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.03262884
wilt AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.03260609
ozone-level-8hr AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.03185221
Satellite AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.03153642
house_16H AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.02623067
dilbert AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.02551334
sf-police-incidents AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.02458316
cmc AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.02433866
qsar-biodeg AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.02417547
tecator AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.02387866
riccardo AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.02380016
bank-marketing AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.02242139
dionis AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.02171851
sylvine AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.02135184
philippine AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.02076368
GesturePhaseSegmentationProcessed AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.01713407
credit-g AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.01695774
KDDCup09_appetency AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.01593346
KDDCup99 AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.01420072
us_crime AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.01348013
adult AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.01313499
airlines AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.01313277
churn AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.01295884
robert AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.01271288
porto-seguro AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.01266177
wine_quality AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.01204569
quake AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.01204383
kick AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.01196944
Diabetes130US AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.01194746
connect-4 AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.01173541
christine AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.01168272
vehicle AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.01147931
elevators AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.01114222
covertype AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.01063218
pc4 AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.01047659
shuttle AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.00962843
Santander_transaction_value AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.00894664
nyc-taxi-green-dec-2016 AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.00838114
Amazon_employee_access AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.00833861
sensory AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.00734658
numerai28.6 AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.0062699
pol AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.00570136
Australian AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.00553865
guillermo AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.00404336
albert AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.00287897
house_sales AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM 1.00173352
mfeat-factors AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM 1.00019518

@Innixma Innixma merged commit 6bcc838 into autogluon:master Feb 21, 2023
@Innixma Innixma mentioned this pull request Feb 21, 2023
4 tasks
@Innixma Innixma modified the milestones: 0.7.1 Release, 0.8 Release May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request module: tabular
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants