Tabular: Accelerate boolean preprocessing #2944

Innixma · 2023-02-17T05:37:58Z

Issue #, if available:

Description of changes:

Accelerate boolean preprocessing

In particular, when many boolean features are present, this logic massively speeds up preprocessing in online and batch inference. Note that there is no impact on datasets with fewer than 15 boolean features, as the overhead of the pd.concat operation exceeds the benefits of the optimization.

For example, the KDDCup09-Upselling dataset has 15,000 columns. Of these columns, over 5000 are boolean features:

	Types of features in original data (exact raw dtype, raw dtype):
		('float64', 'float') :  1287 | ['Var7', 'Var12', 'Var22', 'Var28', 'Var47', ...]
		('int64', 'int')     :  1633 | ['Var11', 'Var21', 'Var49', 'Var78', 'Var82', ...]
		('object', 'object') :   188 | ['Var14741', 'Var14742', 'Var14743', 'Var14745', 'Var14746', ...]
		('uint8', 'int')     : 10338 | ['Var1', 'Var2', 'Var3', 'Var4', 'Var5', ...]
	Types of features in original data (raw dtype, special dtypes):
		('float', [])  :  1287 | ['Var7', 'Var12', 'Var22', 'Var28', 'Var47', ...]
		('int', [])    : 11971 | ['Var1', 'Var2', 'Var3', 'Var4', 'Var5', ...]
		('object', []) :   188 | ['Var14741', 'Var14742', 'Var14743', 'Var14745', 'Var14746', ...]
	Types of features in processed data (exact raw dtype, raw dtype):
		('category', 'category') :  157 | ['Var14742', 'Var14745', 'Var14746', 'Var14747', 'Var14748', ...]
		('float64', 'float')     : 1260 | ['Var7', 'Var12', 'Var22', 'Var28', 'Var47', ...]
		('int64', 'int')         : 1612 | ['Var11', 'Var21', 'Var49', 'Var78', 'Var82', ...]
		('int8', 'int')          : 5806 | ['Var5', 'Var6', 'Var10', 'Var14', 'Var16', ...]
		('uint8', 'int')         : 4611 | ['Var1', 'Var2', 'Var3', 'Var4', 'Var8', ...]
	Types of features in processed data (raw dtype, special dtypes):
		('category', [])  :  157 | ['Var14742', 'Var14745', 'Var14746', 'Var14747', 'Var14748', ...]
		('float', [])     : 1260 | ['Var7', 'Var12', 'Var22', 'Var28', 'Var47', ...]
		('int', [])       : 6223 | ['Var1', 'Var2', 'Var3', 'Var4', 'Var8', ...]
		('int', ['bool']) : 5806 | ['Var5', 'Var6', 'Var10', 'Var14', 'Var16', ...]

Performance comparison

On KDDCup09-Upselling, fit time preprocessing is sped up by 5x (445 s -> 72 s) and inference speed is sped up by 6x (batch_size=1) and 20x (batch_size=10000)

Mainline:

Data preprocessing and feature engineering runtime = 445.02s ...
Throughput for batch_size=1:
	2.359s per row | transform_features
Throughput for batch_size=10:
	0.248s per row | transform_features
Throughput for batch_size=100:
	0.029s per row | transform_features
Throughput for batch_size=1000:
        6.201ms per row | transform_features
Throughput for batch_size=10000:
	4.24ms per row | transform_features

This PR:

Data preprocessing and feature engineering runtime = 71.93s ...
Throughput for batch_size=1:
	0.4s per row | transform_features
Throughput for batch_size=10:
	0.045s per row | transform_features
Throughput for batch_size=100:
	0.011s per row | transform_features
Throughput for batch_size=1000:
	1.404ms per row | transform_features
Throughput for batch_size=10000:
	0.244ms per row | transform_features

TODO:

Benchmark on AMLB

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

github-actions · 2023-02-17T20:34:03Z

Job PR-2944-f400930 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2944/f400930/index.html

liangfu

LGTM, really significant speed up on boolean feature processing!

This reverts commit 522166c.

github-actions · 2023-02-20T23:01:55Z

Job PR-2944-1039341 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2944/1039341/index.html

Innixma · 2023-02-21T02:03:02Z

Benchmark results on AMLB show major speedups in inference throughput for batch size 1 (also major for batch size 10000)

For example, QSAR-TID-11 had a 37.8x speedup

The only dataset that meaningfully became slower was arcene, becoming slower by 3.5x. This is because arcene has 100 boolean features, and 10,000 features overall. Because of this, despite having 100 boolean features, it only makes up 1% of the total features, and thus the in-place edits were faster than the new approach used in this PR. While we could probably make edits to avoid this slow-down, it is clear that in general this PR speeds things up, and we can revisit arcene to speed it up in future if we think it is useful.

Below is the batch_size_1 comparison:

dataset	framework	time_infer_s_rescaled
QSAR-TID-11	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	37.8825899
QSAR-TID-10980	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	37.4026658
cnae-9	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	22.7168149
amazon-commerce-reviews	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	16.4769365
Bioresponse	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	14.9046798
Mercedes_Benz_Greener_Manufacturing	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	8.25304093
Internet-Advertisements	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	7.70006037
KDDCup09-Upselling	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	5.27209839
dna	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	4.71723012
jasmine	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	3.81948861
arcene	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	3.57990337
ada	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	3.48015793
micro-mass	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.99916776
yprop_4_1	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.97469511
Allstate_Claims_Severity	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.40451882
Airlines_DepDelay_10M	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.34000561
eucalyptus	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.25881412
colleges	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.15041293
house_prices_nominal	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.12081512
helena	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.1008263
black_friday	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.08037571
yeast	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.07145099
SAT11-HAND-runtime-regression	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.07120994
Higgs	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.06378258
MiniBooNE	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.06123804
jannis	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.05948484
phoneme	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.05699168
blood-transfusion-service-center	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.04814805
Click_prediction_small	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.0454627
gina	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.04266587
madeline	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.04172659
okcupid-stem	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.04092817
APSFailure	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.03990475
volkert	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.03962174
diamonds	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.038902
topo_2_1	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.03868574
space_ga	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.03800914
segment	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.03656174
Buzzinsocialmedia_Twitter	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.0365023
Fashion-MNIST	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.03368316
first-order-theorem-proving	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.03262884
wilt	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.03260609
ozone-level-8hr	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.03185221
Satellite	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.03153642
house_16H	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.02623067
dilbert	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.02551334
sf-police-incidents	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.02458316
cmc	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.02433866
qsar-biodeg	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.02417547
tecator	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.02387866
riccardo	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.02380016
bank-marketing	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.02242139
dionis	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.02171851
sylvine	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.02135184
philippine	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.02076368
GesturePhaseSegmentationProcessed	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.01713407
credit-g	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.01695774
KDDCup09_appetency	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.01593346
KDDCup99	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.01420072
us_crime	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.01348013
adult	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.01313499
airlines	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.01313277
churn	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.01295884
robert	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.01271288
porto-seguro	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.01266177
wine_quality	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.01204569
quake	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.01204383
kick	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.01196944
Diabetes130US	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.01194746
connect-4	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.01173541
christine	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.01168272
vehicle	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.01147931
elevators	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.01114222
covertype	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.01063218
pc4	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.01047659
shuttle	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.00962843
Santander_transaction_value	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.00894664
nyc-taxi-green-dec-2016	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.00838114
Amazon_employee_access	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.00833861
sensory	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.00734658
numerai28.6	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.0062699
pol	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.00570136
Australian	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.00553865
guillermo	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.00404336
albert	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.00287897
house_sales	AutoGluon_mq_1h8c_2023_02_20_bool_LightGBM	1.00173352
mfeat-factors	AutoGluon_mq_1h8c_2023_02_14_v07_LightGBM	1.00019518

Innixma added enhancement New feature or request module: tabular labels Feb 17, 2023

Innixma added this to the 0.7 Fast-Follow Items milestone Feb 17, 2023

Innixma requested review from gradientsky, tonyhoo and liangfu February 17, 2023 05:40

Innixma force-pushed the accel_preprocess_bool branch from 86db239 to f400930 Compare February 17, 2023 18:50

liangfu approved these changes Feb 17, 2023

View reviewed changes

Innixma and others added 6 commits February 20, 2023 13:15

Cleanup save/load code

be81b34

Accelerate boolean preprocessing

63551c2

Add unit tests

7c242ba

Revert "Cleanup save/load code"

26fb324

This reverts commit 522166c.

Update row threshold to 128

9ed7384

fix doc

1039341

Innixma force-pushed the accel_preprocess_bool branch from f400930 to 1039341 Compare February 20, 2023 21:16

Innixma modified the milestones: 0.7 Fast-Follow Items, 0.7.1 Release Feb 21, 2023

Innixma merged commit 6bcc838 into autogluon:master Feb 21, 2023

Innixma mentioned this pull request Feb 21, 2023

New AutoGluon Webpage #2924

Merged

4 tasks

Innixma mentioned this pull request Mar 1, 2023

Fix AsTypeFeatureGenerator Edge-case Crash #2986

Merged

Innixma modified the milestones: 0.7.1 Release, 0.8 Release May 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tabular: Accelerate boolean preprocessing #2944

Tabular: Accelerate boolean preprocessing #2944

Innixma commented Feb 17, 2023 •

edited

github-actions bot commented Feb 17, 2023

liangfu left a comment

github-actions bot commented Feb 20, 2023

Innixma commented Feb 21, 2023

Tabular: Accelerate boolean preprocessing #2944

Tabular: Accelerate boolean preprocessing #2944

Conversation

Innixma commented Feb 17, 2023 • edited

Performance comparison

github-actions bot commented Feb 17, 2023

liangfu left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 20, 2023

Innixma commented Feb 21, 2023

Innixma commented Feb 17, 2023 •

edited