Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM on GPU with Tensorboard enabled #4797 (problem2) #4834

Conversation

Vladimir-Yashin
Copy link

Current implementation of Tensorboard Callback passes whole
validation_data through sess.run() at once and this causes OOM on GPU
for bigger datasets or leads to much higher memory footprint.

If validation data is split into batches, then it would require to:

  • split data into batches
  • pass each batch through sess.run and save the result as summary_str
    (serialized Summary object)
  • somehow manually take apart these objects and average all histograms
    manually
  • prepare aggregated Summary object and write to Tensorboard log file

Instead of doing that my approach is simpler:

  • sample batch_size worth of data points from validation_data
  • run sess.run() once

This may lead to few problems:

  • histograms won't be 100% accurate since not all data is taken into
    account
  • histograms will slightly vary even if weights didn't change between
    epochs just because each time Tensorboard callback is engaged it will
    pick a different set of samples to process
  • the smaller the batch_size is, the more pronounced those effects are
  • when validation_data is smaller than batch_size some samples are going
    to be used multiple times and some others may not be used at all

However the benefit is worth it, Tensorboard callback won't lead to huge
memory footprint and won't cause OOM crash when whole validation_data
doesn't fit into GPU memory.

pytest for Linux x86_64, Python 3.5.2, TF 0.12.0
pytest_log.txt

Current implementation of Tensorboard Callback passes whole
validation_data through sess.run() at once and this causes OOM on GPU
for bigger datasets or leads to much higher memory footprint.

If validation data is split into batches, then it would require to:
- split data into batches
- pass each batch through sess.run and save the result as summary_str
  (serialized Summary object)
- somehow manually take apart these objects and average all histograms
  manually
- prepare aggregated Summary object and write to Tensorboard log file

Instead of doing that my approach is simpler:
- sample batch_size worth of data points from validation_data
- run sess.run() once

This may lead to few problems:
- histograms won't be 100% accurate since not all data is taken into
  account
- histograms will slightly vary even if weights didn't change between
  epochs just because each time Tensorboard callback is engaged it will
  pick a different set of samples to process
- the smaller the batch_size is, the more pronounced those effects are
- when validation_data is smaller than batch_size some samples are going
  to be used multiple times and some others may not be used at all

However the benefit is worth it, Tensorboard callback won't lead to huge
memory footprint and won't cause OOM crash when whole validation_data
doesn't fit into GPU memory.
@fchollet
Copy link
Member

fchollet commented Jan 4, 2017

Batches may be very small, so only sampling batch_size worth of data points is likely to lead to very inaccurate histograms. It would be safer to go with the first solution you mentioned: iterated calls to sess.run.

@Vladimir-Yashin
Copy link
Author

@fchollet If we have, say, 100 batches in validation_data this would mean plotting 100 histograms, one per batch, on a same plot every epoch (assuming TensorBoard callback is configured to capture histograms every epoch).

Collecting 100 Summary objects and aggregating them manually doesn't seem to be a good idea either.

What if new configuration knob is introduced that would allow to specify how much data does user want to pass through the network to plot histograms? It can be something like 9999999 by default,
so all data would be passed at once as we do today, but will allow for fine tuning for cases where GPU memory is scarce?

@fchollet
Copy link
Member

Closing outdated PR. If you still care about the content of the PR, please submit a new PR to master, updated for the Keras 2.0 API.

@fchollet fchollet closed this Mar 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants