-
Notifications
You must be signed in to change notification settings - Fork 19.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to remove stale models from GPU memory #5345
Comments
Hi, I have a similar issue even when just retraining the same model (interchanging model.fit with model.fit_generator). As I keep all the weights and batch sizes are all equal there shouldn't be a reason for it to consume more memory. |
The only hacky/terrible solution that seems to work involves checkpointing the models that you want to keep, cleaning up all the memory and reloading the models from disk. Anyone knows a better way? Perhaps you can drop specific graphs or variables? m = Model(.....)
m.save(tmp_model_name)
del m
K.clear_session()
m = load_model(tmp_model_name) |
The memory is not released immediately after calling |
Hit the same problem. K.clear_session() doesn't work. |
Would be nice to have some guidance on this issue from folks who have dealt with it more elegantly than the save model/delete model/clear session/load model hack. It is pretty important for reproducibility in Keras in my view. |
Does this work?
|
Same problem here. I'm using an EC2 instance with 100 GB RAM and a Tesla M60 GPU. I wrote a simple iterative loop where I'd perform a grid search on my hyperparameters and validate them on a small subset of my training data. However, I can't do this due to the constant OOM errors, and quite frankly, manual sequential tuning is getting on my nerves. Is there any concrete way to clear the GPU memory utilized by Keras in-code? I don't want to keep restarting my kernel every time. Just FYI, I run I load a model into memory for the first time and Keras utilizes all of the GPU's 8GB memory. Even after calling Does anyone have a concrete solution/workaround to this? |
Try clear_session() before del model - hypothesis being model is needed by clear_session(). |
You save my day! |
after each training cycle worked for me. |
I'm not sure why, but this works for me when I added all of these three lines: K.clear_session() |
Is there a way to tell which tensorflow variables are associated with a specific model? I'd like to only clear the variables associated with a specific model, and then delete it. |
Just a guess. What about creating separate session for each model and then use methods mentioned above to clear that specific session? Presumably, you should clean only variables you want. |
This strategy is currently working for me on a lambda machine; I am using only one GPU at a time: https://stackoverflow.com/a/61252435/12763497 It is a community wiki answer so please feel free to edit if you find anything else out. I do have memory leaks, but they are eliminated using calls to |
Update (2018/08/01): I would like to provide an update as when I posted the question I was new to Keras. Currently only TensorFlow backend supports proper cleaning up of the session. This can be done by calling
K.clear_session()
. This will remove EVERYTHING from memory (models, optimizer objects and anything that has tensors internally). So there is no way to remove a specific stale model. This is not a bug of Keras but a limitation of the backends.I am working on a pipeline that takes a pre-trained model, splits it, caches the intermediate results of the bottom layers, fine-tunes the top and merges bottom & top back. I do 2 passes of the above using different splits & optimizers. This helps me speed up the training by a factor of 3x instead of freezing the bottom layers.
As you understand the above process initializes many models which are later discarded. Unfortunately though it seems that their weights remain in GPU memory and after a couple of steps I get an out of memory exception "ResourceExhaustedError (see above for traceback): OOM when allocating tensor".
Is there a way to remove stale models from GPU memory? I tried "del" and calling Python's gc but did not work. Closing/clearing the session is not possible as this is part of a single pipeline. My backend is Tensorflow.
Here is a simplified pseudo-code of the process:
The text was updated successfully, but these errors were encountered: