Make iterable dataset more efficient #219

epwalsh · 2023-06-21T17:34:55Z

Makes our IterableDataset more memory efficient by saving the (shuffled) global indices to a numpy memory-mapped file before training starts (from the rank 0 process only).
This file serves as the global data order truth, and therefore it's not necessary to save the indices of training data to the .tsv.gz files like we were doing before.
So this PR also removes that functionality.

epwalsh added 7 commits June 21, 2023 10:32

Make iterable dataset more efficient

2c02044

update inspect data script

0929e6c

fix comment

1cc00e6

clean up config

28a5fcf

clean up

5ce8746

fix

1c94650

another fix

87b079f

epwalsh merged commit 43c29d9 into main Jun 21, 2023

epwalsh deleted the iterable-dataset-memory-efficient branch June 21, 2023 19:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make iterable dataset more efficient #219

Make iterable dataset more efficient #219

epwalsh commented Jun 21, 2023

Make iterable dataset more efficient #219

Make iterable dataset more efficient #219

Conversation

epwalsh commented Jun 21, 2023