Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Post 0.6][Tabular] make tabular nn dataset iterable #2395

Merged
merged 2 commits into from
Nov 23, 2022

Conversation

liangfu
Copy link
Collaborator

@liangfu liangfu commented Nov 15, 2022

Description of changes:
This PR converts map-based tabular nn dataset class into an iterable dataset, which makes batch-based data loading significantly faster.

For instance, loading 9769 rows in test.csv has been reduced from 49 ms to 9 ms.

Data loading time has been measured with following code snippet.

        tic = time.time()
        subtotal = 0
        for batch_idx, data_batch in enumerate(val_dataloader):
            tik = time.time()
            preds_batch = self.model.predict(data_batch)
            preds_dataset.append(preds_batch)
            subtotal += time.time() - tik
        total = time.time()-tic
        print(f"elapsed (dataloader): {(total-subtotal)*1000:.0f} ms")
        print(f"elapsed (predict): {subtotal*1000:.0f} ms")

Under Linux, we are able to achieve 40% overall time savings for batch_size>10000.

image

See following chart for more benchmark results.

image

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@github-actions
Copy link

Job PR-2395-8f85dcb is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2395/8f85dcb/index.html

@liangfu liangfu changed the title [Tabular] make tabular nn dataset iterable [Post 0.6][Tabular] make tabular nn dataset iterable Nov 17, 2022
@liangfu
Copy link
Collaborator Author

liangfu commented Nov 21, 2022

cc @tonyhoo @Innixma

Copy link
Contributor

@Innixma Innixma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good! I am able to reproduce the speedups on multiple datasets, nice work!!

@github-actions
Copy link

Job PR-2395-2341393 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-2395/2341393/index.html

Comment on lines +144 to +146
# Drop last batch
if self.drop_last and (idx_start + self.batch_size) > self.num_examples:
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were we dropping last before this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This depends on the drop_last argument in the data loader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found the answer, we were, if you look at deleted code lines

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't drop_last or shuffle when is_test==True, we were dropping last and shuffle the dataset for training. The behavior is maintained to be consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants