-
Notifications
You must be signed in to change notification settings - Fork 855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Post 0.6][Tabular] make tabular nn dataset iterable #2395
Conversation
eedbfa6
to
8f85dcb
Compare
Job PR-2395-8f85dcb is done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good! I am able to reproduce the speedups on multiple datasets, nice work!!
tabular/src/autogluon/tabular/models/tabular_nn/torch/torch_network_modules.py
Outdated
Show resolved
Hide resolved
tabular/src/autogluon/tabular/models/tabular_nn/torch/tabular_torch_dataset.py
Show resolved
Hide resolved
8f85dcb
to
2341393
Compare
Job PR-2395-2341393 is done. |
# Drop last batch | ||
if self.drop_last and (idx_start + self.batch_size) > self.num_examples: | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were we dropping last before this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This depends on the drop_last argument in the data loader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found the answer, we were, if you look at deleted code lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We won't drop_last or shuffle when is_test==True, we were dropping last and shuffle the dataset for training. The behavior is maintained to be consistent.
Description of changes:
This PR converts map-based tabular nn dataset class into an iterable dataset, which makes batch-based data loading significantly faster.
For instance, loading 9769 rows in test.csv has been reduced from 49 ms to 9 ms.
Data loading time has been measured with following code snippet.
Under Linux, we are able to achieve 40% overall time savings for batch_size>10000.
See following chart for more benchmark results.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.