Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to remove stale models from GPU memory #5345

Closed
datumbox opened this issue Feb 9, 2017 · 15 comments
Closed

How to remove stale models from GPU memory #5345

datumbox opened this issue Feb 9, 2017 · 15 comments

Comments

@datumbox
Copy link
Contributor

datumbox commented Feb 9, 2017

Update (2018/08/01): I would like to provide an update as when I posted the question I was new to Keras. Currently only TensorFlow backend supports proper cleaning up of the session. This can be done by calling K.clear_session(). This will remove EVERYTHING from memory (models, optimizer objects and anything that has tensors internally). So there is no way to remove a specific stale model. This is not a bug of Keras but a limitation of the backends.


I am working on a pipeline that takes a pre-trained model, splits it, caches the intermediate results of the bottom layers, fine-tunes the top and merges bottom & top back. I do 2 passes of the above using different splits & optimizers. This helps me speed up the training by a factor of 3x instead of freezing the bottom layers.

As you understand the above process initializes many models which are later discarded. Unfortunately though it seems that their weights remain in GPU memory and after a couple of steps I get an out of memory exception "ResourceExhaustedError (see above for traceback): OOM when allocating tensor".

Is there a way to remove stale models from GPU memory? I tried "del" and calling Python's gc but did not work. Closing/clearing the session is not possible as this is part of a single pipeline. My backend is Tensorflow.

Here is a simplified pseudo-code of the process:

model = load_pretrained_model()
bottom, top = split_model(model) #bottom and top have a fresh copy of the weights
del model
gc.collect()

intermediate_results = bottom.predit(data)
top.fit(intermediate_results)
del intermediate_results, data

model = merge_model(top, bottom) #Exception happens here
del top, bottom
gc.collect()
@Markus92
Copy link

Hi, I have a similar issue even when just retraining the same model (interchanging model.fit with model.fit_generator). As I keep all the weights and batch sizes are all equal there shouldn't be a reason for it to consume more memory.

@datumbox
Copy link
Contributor Author

datumbox commented Feb 10, 2017

The only hacky/terrible solution that seems to work involves checkpointing the models that you want to keep, cleaning up all the memory and reloading the models from disk. Anyone knows a better way? Perhaps you can drop specific graphs or variables?

m = Model(.....)
m.save(tmp_model_name)
del m
K.clear_session()
m = load_model(tmp_model_name)

@stale stale bot added the stale label May 23, 2017
@stale stale bot closed this as completed Jun 22, 2017
@astrojuanlu
Copy link

The memory is not released immediately after calling K.clear_session(). I guess one needs to run load_model afterwards?

@ruiyuanlu
Copy link

Hit the same problem. K.clear_session() doesn't work.

@drsxr
Copy link

drsxr commented Apr 23, 2018

Would be nice to have some guidance on this issue from folks who have dealt with it more elegantly than the save model/delete model/clear session/load model hack. It is pretty important for reproducibility in Keras in my view.

@tRosenflanz
Copy link

Does this work?

    import gc
    K.clear_session()
    gc.collect()

@rahulkulhalli
Copy link

Same problem here. I'm using an EC2 instance with 100 GB RAM and a Tesla M60 GPU. I wrote a simple iterative loop where I'd perform a grid search on my hyperparameters and validate them on a small subset of my training data. However, I can't do this due to the constant OOM errors, and quite frankly, manual sequential tuning is getting on my nerves. Is there any concrete way to clear the GPU memory utilized by Keras in-code? I don't want to keep restarting my kernel every time.

Just FYI, I run watch -d nvidia-smi in order to keep a track on the GPU memory.

I load a model into memory for the first time and Keras utilizes all of the GPU's 8GB memory. Even after calling K.clear_session() or del K or gc.collect(), it doesn't clear the stale model from the memory.

Does anyone have a concrete solution/workaround to this?

@ms1design
Copy link

ms1design commented Jul 30, 2018

Check bottom of #2102 and #9379

@phobrain
Copy link

phobrain commented Oct 4, 2018

Try clear_session() before del model - hypothesis being model is needed by clear_session().

@flyingalexis
Copy link

The memory is not released immediately after calling K.clear_session(). I guess one needs to run load_model afterwards?

You save my day!
but btw how do you know that

@danFromTelAviv
Copy link
Contributor

    K.clear_session()
    del model

after each training cycle worked for me.

@xl233
Copy link

xl233 commented Apr 10, 2019

I'm not sure why, but this works for me when I added all of these three lines:

K.clear_session()
gc.collect()
del model

@zachmayer
Copy link
Contributor

Is there a way to tell which tensorflow variables are associated with a specific model?

I'd like to only clear the variables associated with a specific model, and then delete it.

@maks-ym
Copy link

maks-ym commented Sep 1, 2019

@zachmayer

Is there a way to tell which tensorflow variables are associated with a specific model?

I'd like to only clear the variables associated with a specific model, and then delete it.

Just a guess. What about creating separate session for each model and then use methods mentioned above to clear that specific session? Presumably, you should clean only variables you want.

@fPkX6F1nGTX
Copy link

This strategy is currently working for me on a lambda machine; I am using only one GPU at a time:

https://stackoverflow.com/a/61252435/12763497

It is a community wiki answer so please feel free to edit if you find anything else out. I do have memory leaks, but they are eliminated using calls to multiprocessing.Process with a timeout feature (which does require estimating the maximum duration of each model training/validation session).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests