Load data from disk into memory in batches #52

yashjakhotiya · 2019-10-01T06:24:46Z

Instead of loading all training and test data, can we load the data in memory in batches, i.e. on the fly during training and evaluation?

mmjb · 2019-10-01T08:43:02Z

Thank you for the suggestion. This is not easy to implement for training (where shuffling between epochs requires having access to all the data, or at least some way of addressing smaller groups of samples at a time), but could be an easy enhancement for evaluation.

yashjakhotiya · 2019-10-01T11:09:05Z

One way is to have a single training example in a single file, shuffle the list of filenames before every epoch (as you already do), prefetch batches and preprocess them using a CPU while the model trains on GPUs, and maybe use tf.data in the entire process.

mmjb · 2019-10-01T11:50:13Z

Yes, that's roughly how to do it. Two notes on that:

One example per file has surprising side effects at this scale (e.g., many filesystems become very slow when millions of files are in one directory, so you need subfolders; reading one file from the can lead to horrible disk access patterns; shuffling + length of an epoch make sure that caches don't work well). These things can be mitigated by using K samples per file, and then reading from several files at once (so each minibatch of N samples is drawn from M << N files).
The minibatching routine is a bit complicated to handle joint training on many languages, optional random switching between docstring and function name embeddings, etc. Pushing this into tf.data is possible, but painful.

So, overall: It's a substantial amount of work that we are most likely not going to do. The released baselines are really just meant as a "here's a simple straightforward approach, beat that" - we are happy for others to improve on this, either by entirely rewriting things or improving our codebase.

yashjakhotiya · 2019-10-01T12:07:47Z

Got your point :)

This is an example, although on the data pipeline side, of general problems faced with system requirements when trying to beat SOTA language models these days.

Anyway, closing the issue now. Thank you for your consideration.

mmjb added the enhancement New feature or request label Oct 1, 2019

yashjakhotiya closed this as completed Oct 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load data from disk into memory in batches #52

Load data from disk into memory in batches #52

yashjakhotiya commented Oct 1, 2019

mmjb commented Oct 1, 2019

yashjakhotiya commented Oct 1, 2019

mmjb commented Oct 1, 2019

yashjakhotiya commented Oct 1, 2019

Load data from disk into memory in batches #52

Load data from disk into memory in batches #52

Comments

yashjakhotiya commented Oct 1, 2019

mmjb commented Oct 1, 2019

yashjakhotiya commented Oct 1, 2019

mmjb commented Oct 1, 2019

yashjakhotiya commented Oct 1, 2019