Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The paper Don't Decay the Learning Rate, Increase the Batch Size makes the case for increasing the batch size over time instead of annealing the learning rate.
This PR adds the possibility to have arbitrarily large mini-batch sizes with an accumulating gradient strategy (closes #1072). It introduces the parameter
mini_batch_chunk_size
that you can set to break down large mini-batches into smaller chunks for processing purposes.So let's say you want to have a mini-batch size of 128, but your memory cannot handle more than 32 samples at a time. Then you can train like this:
Because we now can arbitrarly raise mini-batch size, we can now execute the annealing strategy in the above paper. Do it like this:
This will double the mini-batch size each time the learning rate anneals. You can also combine this with "annealing with restarts" in which the last best model state is restored each time the learning rate anneals.