-
Notifications
You must be signed in to change notification settings - Fork 861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tabular: Accelerate boolean preprocessing #2944
Conversation
86db239
to
f400930
Compare
Job PR-2944-f400930 is done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, really significant speed up on boolean feature processing!
f400930
to
1039341
Compare
Job PR-2944-1039341 is done. |
Benchmark results on AMLB show major speedups in inference throughput for batch size 1 (also major for batch size 10000) For example, QSAR-TID-11 had a 37.8x speedup The only dataset that meaningfully became slower was Below is the batch_size_1 comparison:
|
Issue #, if available:
Description of changes:
In particular, when many boolean features are present, this logic massively speeds up preprocessing in online and batch inference. Note that there is no impact on datasets with fewer than 15 boolean features, as the overhead of the pd.concat operation exceeds the benefits of the optimization.
For example, the
KDDCup09-Upselling
dataset has 15,000 columns. Of these columns, over 5000 are boolean features:Performance comparison
On
KDDCup09-Upselling
, fit time preprocessing is sped up by 5x (445 s -> 72 s) and inference speed is sped up by 6x (batch_size=1) and 20x (batch_size=10000)Mainline:
This PR:
TODO:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.