Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pack batches more tightly with maximum_samples_per_batch. #2111

Merged
merged 6 commits into from Nov 29, 2018

Conversation

@brendan-ai2
Copy link
Member

commented Nov 29, 2018

  • Previously when maximum_samples_per_batch was used overly large batches would be split into equal sizes.
  • This could lead to a certain degree of fragmentation.
  • This change packs batches until just before maximum_samples_per_batch (or batch_size) is exceeded.
@brendan-ai2

This comment has been minimized.

Copy link
Member Author

commented Nov 29, 2018

@swabhs, pull this down and give it a try. Here's some sample iterator config that you can then adjust to your liking:

{
  "type": "bucket",
  # Should be decently larger than your batch size.
  "max_instances_in_memory": 4096 * NUM_GPUS,
  # Should be large enough that you'll always (or almost always) split based on maximum_samples_per_batch set below.
  "batch_size": 512 * NUM_GPUS,
  # Adjust as appropriate.
  "sorting_keys": [["source", "num_tokens"]],
  # Adjust as appropriate.
  "biggest_batch_first": true,
  # Adjust as appropriate.
  "maximum_samples_per_batch": ["num_tokens", NUM_GPUS * 2000]
}
brendan-ai2 added 2 commits Nov 29, 2018

@brendan-ai2 brendan-ai2 requested a review from DeNeutoy Nov 29, 2018

@brendan-ai2

This comment has been minimized.

Copy link
Member Author

commented Nov 29, 2018

@DeNeutoy, here is the iterator improvement we discussed the other week. I ended up implementing it as a change to the existing iterators as it seems strictly better than the old method. But perhaps I'm missing something in terms of backwards compatibility. Let me know! Thanks!

@DeNeutoy
Copy link
Contributor

left a comment

Awesome, looks much better than what was before 👍

def _ensure_batch_is_sufficiently_small(
self,
batch_instances: Iterable[Instance],
excess: deque) -> List[List[Instance]]:

This comment has been minimized.

Copy link
@DeNeutoy

DeNeutoy Nov 29, 2018

Contributor

For everything but core types, it's better to use the collection's corresponding entry in the typing module, because it allows you to specify what the collection is of, e.g:

from typing import Deque

...
excess: Deque[Instance]) -> ....

It's also less confusing, because they are strictly abstract, so it's obvious they're being used for type hints or inheritance.

This comment has been minimized.

Copy link
@brendan-ai2

brendan-ai2 Nov 29, 2018

Author Member

Good call, thanks! I didn't know that all collection classes had corresponding types.

If self._maximum_samples_per_batch is specified, then split the batch
into smaller sub-batches if it exceeds the maximum size.
Any excess passed in will be used first. When the method returns excess

This comment has been minimized.

Copy link
@DeNeutoy

DeNeutoy Nov 29, 2018

Contributor

Could you mention explicitly in this docstring that excess is mutated in-place.

This comment has been minimized.

Copy link
@brendan-ai2

brendan-ai2 Nov 29, 2018

Author Member

Cleaned up the language and left a big warning.

@DeNeutoy
Copy link
Contributor

left a comment

Could you also change the docstring of this class to reflect the new behaviour ✍️

brendan-ai2 added 2 commits Nov 29, 2018
@brendan-ai2

This comment has been minimized.

Copy link
Member Author

commented Nov 29, 2018

Class docstring changed too. Thanks for the speedy review!

@brendan-ai2 brendan-ai2 merged commit 240974f into allenai:master Nov 29, 2018

3 checks passed

Pull Requests (AllenNLP Library) TeamCity build finished
Details
codecov/patch 100% of diff hit (target 90%)
Details
codecov/project 92% (+<1%) compared to 42d076e
Details
@swabhs

This comment has been minimized.

Copy link

commented Nov 30, 2018

@swabhs, pull this down and give it a try. Here's some sample iterator config that you can then adjust to your liking:

{
  "type": "bucket",
  # Should be decently larger than your batch size.
  "max_instances_in_memory": 4096 * NUM_GPUS,
  # Should be large enough that you'll always (or almost always) split based on maximum_samples_per_batch set below.
  "batch_size": 512 * NUM_GPUS,
  # Adjust as appropriate.
  "sorting_keys": [["source", "num_tokens"]],
  # Adjust as appropriate.
  "biggest_batch_first": true,
  # Adjust as appropriate.
  "maximum_samples_per_batch": ["num_tokens", NUM_GPUS * 2000]
}

This is running now, and is allowing bigger batches - hope it is the solution I have been looking for!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.