OOM on GPU with Tensorboard enabled #4797 (problem2) #4834

Vladimir-Yashin · 2016-12-25T10:51:42Z

Current implementation of Tensorboard Callback passes whole
validation_data through sess.run() at once and this causes OOM on GPU
for bigger datasets or leads to much higher memory footprint.

If validation data is split into batches, then it would require to:

split data into batches
pass each batch through sess.run and save the result as summary_str
(serialized Summary object)
somehow manually take apart these objects and average all histograms
manually
prepare aggregated Summary object and write to Tensorboard log file

Instead of doing that my approach is simpler:

sample batch_size worth of data points from validation_data
run sess.run() once

This may lead to few problems:

histograms won't be 100% accurate since not all data is taken into
account
histograms will slightly vary even if weights didn't change between
epochs just because each time Tensorboard callback is engaged it will
pick a different set of samples to process
the smaller the batch_size is, the more pronounced those effects are
when validation_data is smaller than batch_size some samples are going
to be used multiple times and some others may not be used at all

However the benefit is worth it, Tensorboard callback won't lead to huge
memory footprint and won't cause OOM crash when whole validation_data
doesn't fit into GPU memory.

pytest for Linux x86_64, Python 3.5.2, TF 0.12.0
pytest_log.txt

Current implementation of Tensorboard Callback passes whole validation_data through sess.run() at once and this causes OOM on GPU for bigger datasets or leads to much higher memory footprint. If validation data is split into batches, then it would require to: - split data into batches - pass each batch through sess.run and save the result as summary_str (serialized Summary object) - somehow manually take apart these objects and average all histograms manually - prepare aggregated Summary object and write to Tensorboard log file Instead of doing that my approach is simpler: - sample batch_size worth of data points from validation_data - run sess.run() once This may lead to few problems: - histograms won't be 100% accurate since not all data is taken into account - histograms will slightly vary even if weights didn't change between epochs just because each time Tensorboard callback is engaged it will pick a different set of samples to process - the smaller the batch_size is, the more pronounced those effects are - when validation_data is smaller than batch_size some samples are going to be used multiple times and some others may not be used at all However the benefit is worth it, Tensorboard callback won't lead to huge memory footprint and won't cause OOM crash when whole validation_data doesn't fit into GPU memory.

fchollet · 2017-01-04T23:37:22Z

Batches may be very small, so only sampling batch_size worth of data points is likely to lead to very inaccurate histograms. It would be safer to go with the first solution you mentioned: iterated calls to sess.run.

Vladimir-Yashin · 2017-01-05T08:32:21Z

@fchollet If we have, say, 100 batches in validation_data this would mean plotting 100 histograms, one per batch, on a same plot every epoch (assuming TensorBoard callback is configured to capture histograms every epoch).

Collecting 100 Summary objects and aggregating them manually doesn't seem to be a good idea either.

What if new configuration knob is introduced that would allow to specify how much data does user want to pass through the network to plot histograms? It can be something like 9999999 by default,
so all data would be passed at once as we do today, but will allow for fine tuning for cases where GPU memory is scarce?

fchollet · 2017-03-15T21:43:59Z

Closing outdated PR. If you still care about the content of the PR, please submit a new PR to master, updated for the Keras 2.0 API.

Vladimir-Yashin mentioned this pull request Dec 25, 2016

fit_generator converges slower and tensorboard's OOM problem #4797

Closed

fchollet closed this Mar 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM on GPU with Tensorboard enabled #4797 (problem2) #4834

OOM on GPU with Tensorboard enabled #4797 (problem2) #4834

Vladimir-Yashin commented Dec 25, 2016

fchollet commented Jan 4, 2017

Vladimir-Yashin commented Jan 5, 2017

fchollet commented Mar 15, 2017

OOM on GPU with Tensorboard enabled #4797 (problem2) #4834

OOM on GPU with Tensorboard enabled #4797 (problem2) #4834

Conversation

Vladimir-Yashin commented Dec 25, 2016

fchollet commented Jan 4, 2017

Vladimir-Yashin commented Jan 5, 2017

fchollet commented Mar 15, 2017