Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make iterable dataset more efficient #219

Merged
merged 7 commits into from
Jun 21, 2023
Merged

Conversation

epwalsh
Copy link
Member

@epwalsh epwalsh commented Jun 21, 2023

Makes our IterableDataset more memory efficient by saving the (shuffled) global indices to a numpy memory-mapped file before training starts (from the rank 0 process only).
This file serves as the global data order truth, and therefore it's not necessary to save the indices of training data to the .tsv.gz files like we were doing before.
So this PR also removes that functionality.

@epwalsh epwalsh merged commit 43c29d9 into main Jun 21, 2023
@epwalsh epwalsh deleted the iterable-dataset-memory-efficient branch June 21, 2023 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant