Skip to content
This repository has been archived by the owner on Apr 11, 2023. It is now read-only.

Load data from disk into memory in batches #52

Closed
yashjakhotiya opened this issue Oct 1, 2019 · 4 comments
Closed

Load data from disk into memory in batches #52

yashjakhotiya opened this issue Oct 1, 2019 · 4 comments
Labels
enhancement New feature or request

Comments

@yashjakhotiya
Copy link

Instead of loading all training and test data, can we load the data in memory in batches, i.e. on the fly during training and evaluation?

@mmjb mmjb added the enhancement New feature or request label Oct 1, 2019
@mmjb
Copy link
Contributor

mmjb commented Oct 1, 2019

Thank you for the suggestion. This is not easy to implement for training (where shuffling between epochs requires having access to all the data, or at least some way of addressing smaller groups of samples at a time), but could be an easy enhancement for evaluation.

@yashjakhotiya
Copy link
Author

One way is to have a single training example in a single file, shuffle the list of filenames before every epoch (as you already do), prefetch batches and preprocess them using a CPU while the model trains on GPUs, and maybe use tf.data in the entire process.

@mmjb
Copy link
Contributor

mmjb commented Oct 1, 2019

Yes, that's roughly how to do it. Two notes on that:

  1. One example per file has surprising side effects at this scale (e.g., many filesystems become very slow when millions of files are in one directory, so you need subfolders; reading one file from the can lead to horrible disk access patterns; shuffling + length of an epoch make sure that caches don't work well). These things can be mitigated by using K samples per file, and then reading from several files at once (so each minibatch of N samples is drawn from M << N files).
  2. The minibatching routine is a bit complicated to handle joint training on many languages, optional random switching between docstring and function name embeddings, etc. Pushing this into tf.data is possible, but painful.

So, overall: It's a substantial amount of work that we are most likely not going to do. The released baselines are really just meant as a "here's a simple straightforward approach, beat that" - we are happy for others to improve on this, either by entirely rewriting things or improving our codebase.

@yashjakhotiya
Copy link
Author

Got your point :)

This is an example, although on the data pipeline side, of general problems faced with system requirements when trying to beat SOTA language models these days.

Anyway, closing the issue now. Thank you for your consideration.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants