You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Apr 11, 2023. It is now read-only.
Thank you for the suggestion. This is not easy to implement for training (where shuffling between epochs requires having access to all the data, or at least some way of addressing smaller groups of samples at a time), but could be an easy enhancement for evaluation.
One way is to have a single training example in a single file, shuffle the list of filenames before every epoch (as you already do), prefetch batches and preprocess them using a CPU while the model trains on GPUs, and maybe use tf.data in the entire process.
Yes, that's roughly how to do it. Two notes on that:
One example per file has surprising side effects at this scale (e.g., many filesystems become very slow when millions of files are in one directory, so you need subfolders; reading one file from the can lead to horrible disk access patterns; shuffling + length of an epoch make sure that caches don't work well). These things can be mitigated by using K samples per file, and then reading from several files at once (so each minibatch of N samples is drawn from M << N files).
The minibatching routine is a bit complicated to handle joint training on many languages, optional random switching between docstring and function name embeddings, etc. Pushing this into tf.data is possible, but painful.
So, overall: It's a substantial amount of work that we are most likely not going to do. The released baselines are really just meant as a "here's a simple straightforward approach, beat that" - we are happy for others to improve on this, either by entirely rewriting things or improving our codebase.
This is an example, although on the data pipeline side, of general problems faced with system requirements when trying to beat SOTA language models these days.
Anyway, closing the issue now. Thank you for your consideration.
Instead of loading all training and test data, can we load the data in memory in batches, i.e. on the fly during training and evaluation?
The text was updated successfully, but these errors were encountered: