How to remove stale models from GPU memory #5345

datumbox · 2017-02-09T17:52:02Z

Update (2018/08/01): I would like to provide an update as when I posted the question I was new to Keras. Currently only TensorFlow backend supports proper cleaning up of the session. This can be done by calling K.clear_session(). This will remove EVERYTHING from memory (models, optimizer objects and anything that has tensors internally). So there is no way to remove a specific stale model. This is not a bug of Keras but a limitation of the backends.

I am working on a pipeline that takes a pre-trained model, splits it, caches the intermediate results of the bottom layers, fine-tunes the top and merges bottom & top back. I do 2 passes of the above using different splits & optimizers. This helps me speed up the training by a factor of 3x instead of freezing the bottom layers.

As you understand the above process initializes many models which are later discarded. Unfortunately though it seems that their weights remain in GPU memory and after a couple of steps I get an out of memory exception "ResourceExhaustedError (see above for traceback): OOM when allocating tensor".

Is there a way to remove stale models from GPU memory? I tried "del" and calling Python's gc but did not work. Closing/clearing the session is not possible as this is part of a single pipeline. My backend is Tensorflow.

Here is a simplified pseudo-code of the process:

model = load_pretrained_model()
bottom, top = split_model(model) #bottom and top have a fresh copy of the weights
del model
gc.collect()

intermediate_results = bottom.predit(data)
top.fit(intermediate_results)
del intermediate_results, data

model = merge_model(top, bottom) #Exception happens here
del top, bottom
gc.collect()

The text was updated successfully, but these errors were encountered:

Markus92 · 2017-02-10T01:47:10Z

Hi, I have a similar issue even when just retraining the same model (interchanging model.fit with model.fit_generator). As I keep all the weights and batch sizes are all equal there shouldn't be a reason for it to consume more memory.

datumbox · 2017-02-10T11:36:50Z

The only hacky/terrible solution that seems to work involves checkpointing the models that you want to keep, cleaning up all the memory and reloading the models from disk. Anyone knows a better way? Perhaps you can drop specific graphs or variables?

m = Model(.....)
m.save(tmp_model_name)
del m
K.clear_session()
m = load_model(tmp_model_name)

astrojuanlu · 2017-06-26T16:13:53Z

The memory is not released immediately after calling K.clear_session(). I guess one needs to run load_model afterwards?

ruiyuanlu · 2018-04-16T08:33:47Z

Hit the same problem. K.clear_session() doesn't work.

drsxr · 2018-04-23T15:39:17Z

Would be nice to have some guidance on this issue from folks who have dealt with it more elegantly than the save model/delete model/clear session/load model hack. It is pretty important for reproducibility in Keras in my view.

tRosenflanz · 2018-05-31T22:53:20Z

Does this work?

    import gc
    K.clear_session()
    gc.collect()

rahulkulhalli · 2018-07-30T11:02:56Z

Same problem here. I'm using an EC2 instance with 100 GB RAM and a Tesla M60 GPU. I wrote a simple iterative loop where I'd perform a grid search on my hyperparameters and validate them on a small subset of my training data. However, I can't do this due to the constant OOM errors, and quite frankly, manual sequential tuning is getting on my nerves. Is there any concrete way to clear the GPU memory utilized by Keras in-code? I don't want to keep restarting my kernel every time.

Just FYI, I run watch -d nvidia-smi in order to keep a track on the GPU memory.

I load a model into memory for the first time and Keras utilizes all of the GPU's 8GB memory. Even after calling K.clear_session() or del K or gc.collect(), it doesn't clear the stale model from the memory.

Does anyone have a concrete solution/workaround to this?

ms1design · 2018-07-30T22:23:19Z

Check bottom of #2102 and #9379

phobrain · 2018-10-04T02:11:34Z

Try clear_session() before del model - hypothesis being model is needed by clear_session().

flyingalexis · 2018-11-25T18:39:25Z

The memory is not released immediately after calling K.clear_session(). I guess one needs to run load_model afterwards?

You save my day!
but btw how do you know that

danFromTelAviv · 2018-12-03T06:37:46Z

    K.clear_session()
    del model

after each training cycle worked for me.

xl233 · 2019-04-10T14:41:28Z

I'm not sure why, but this works for me when I added all of these three lines:

K.clear_session()
gc.collect()
del model

zachmayer · 2019-08-01T17:57:14Z

Is there a way to tell which tensorflow variables are associated with a specific model?

I'd like to only clear the variables associated with a specific model, and then delete it.

maks-ym · 2019-09-01T09:30:49Z

@zachmayer

Is there a way to tell which tensorflow variables are associated with a specific model?

I'd like to only clear the variables associated with a specific model, and then delete it.

Just a guess. What about creating separate session for each model and then use methods mentioned above to clear that specific session? Presumably, you should clean only variables you want.

fPkX6F1nGTX · 2020-04-16T14:22:42Z

This strategy is currently working for me on a lambda machine; I am using only one GPU at a time:

https://stackoverflow.com/a/61252435/12763497

It is a community wiki answer so please feel free to edit if you find anything else out. I do have memory leaks, but they are eliminated using calls to multiprocessing.Process with a timeout feature (which does require estimating the maximum duration of each model training/validation session).

stale bot added the stale label May 23, 2017

stale bot closed this as completed Jun 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to remove stale models from GPU memory #5345

How to remove stale models from GPU memory #5345

datumbox commented Feb 9, 2017 •

edited

Markus92 commented Feb 10, 2017

datumbox commented Feb 10, 2017 •

edited

astrojuanlu commented Jun 26, 2017

ruiyuanlu commented Apr 16, 2018

drsxr commented Apr 23, 2018

tRosenflanz commented May 31, 2018

rahulkulhalli commented Jul 30, 2018

ms1design commented Jul 30, 2018 •

edited

phobrain commented Oct 4, 2018

flyingalexis commented Nov 25, 2018

danFromTelAviv commented Dec 3, 2018

xl233 commented Apr 10, 2019

zachmayer commented Aug 1, 2019

maks-ym commented Sep 1, 2019 •

edited

fPkX6F1nGTX commented Apr 16, 2020

How to remove stale models from GPU memory #5345

How to remove stale models from GPU memory #5345

Comments

datumbox commented Feb 9, 2017 • edited

Markus92 commented Feb 10, 2017

datumbox commented Feb 10, 2017 • edited

astrojuanlu commented Jun 26, 2017

ruiyuanlu commented Apr 16, 2018

drsxr commented Apr 23, 2018

tRosenflanz commented May 31, 2018

rahulkulhalli commented Jul 30, 2018

ms1design commented Jul 30, 2018 • edited

phobrain commented Oct 4, 2018

flyingalexis commented Nov 25, 2018

danFromTelAviv commented Dec 3, 2018

xl233 commented Apr 10, 2019

zachmayer commented Aug 1, 2019

maks-ym commented Sep 1, 2019 • edited

fPkX6F1nGTX commented Apr 16, 2020

datumbox commented Feb 9, 2017 •

edited

datumbox commented Feb 10, 2017 •

edited

ms1design commented Jul 30, 2018 •

edited

maks-ym commented Sep 1, 2019 •

edited