-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GpuArrayException: Out of memory #5
Comments
Hi! You should not have memory problems training on a Titan X since we trained our model on GPUs with far less memory. The issue seems to be related with the python generator that feeds data to the discriminator (not to be confused with the generator network G that generates vessel trees). It seems that the generator d_gen is raising an error and is not returning an (x, y) pair.
Please let me know if this solves your problem. |
Hi there, Thank you for the prompt reply! The answer is yes to all three of your points. The weird thing is, when I tried debugging the code yesterday, I was repeatedly calling next() on the iterator that supposedly threw the None and it never gave me this. It is really confusing indeed. If it helps any, I did make a fork of the code and had to change a few things to get the code to work (I'm using Keras 2), namely the change I did in Side detail: I wasn't sure why the code had Does anything in that link indicate the issue? If we reach a dead end, maybe I can send you the data files I'm using. They're just some images from DRIVE that I converted to jpg. Edit: the exception seems to get thrown when we call |
Hi, I have seen a similar error but only on the tensorflow backend, not on theano. The workaround I found was make the first call to next(d_gen) on the main thread. This is documented and implemented on the I think that the iterator does not return None but raises an exception. If you want to check it you might want to add a try/except inside the iterator's code and print the exception. That might give a little bit more information. Let me know if this helps. You say that if you call next(d_gen) it returns a valid x, y pair, but the fit_generator gives an error? If that is true, the fastest workaround I am seeing is to use the train_on_batch method instead of fit_generator. But it should work with the fit_generator... About the You could send me the files but the code ran on my computer before, so I guess I will not be able to reproduce your error :( |
Hi, I just found the bug. The discriminator generator makes a call to the model's Apparently this can be 'fixed' using the solution in this answer, but I have tried that (and many other things) and it still doesn't work out. Do you have any ideas or workarounds for this? Edit: I just realised you said the code ran on your machine earlier. What version of Keras are you using? |
Hi, Since the call first call to the iterator is not on the I do not have a very clean solution. A workaround for that would be to, instead of using
It will probably be a bit slower but should solve the problem. If you ever find a better solution for this problem a pull request is most welcomed. In the meantime I will give it a thought and try to search about this issue. |
Are you using a current version of Keras, or an older version? With regard to the StackOverflow answer... I tried this but to no avail, e.g. calling |
I used an older version of Keras. The latest version prior to the Keras 2.0, I can not confirm the exact version at the moment. Calling It is strange that it does not work on the latest version of Keras. Could you please try using the
Of course, this will not have the |
Yes, I can confirm that Edit: what is also weird is that if I keep |
Yes, you can not use callbacks and I guess that it will be slightly slower. The Ohhhh... Then maybe the model is just compiling. The first time it takes a lot of time. Can you check if your processor is being used while it is hanging? There should be some cuda processes running. If that's the case, just leave it running for a while. |
No I don't think it's compiling anything (I checked for any other running processes). I also had this weird issue where it was also hanging when I was just trying to use the |
The What happens is that, when you use the exact same architecture, the first time it compiles it takes a lot of time but, while it is compiling, theano saves some files that will make it faster the second time you run it. That might explain why when you tried it the first time it hanged and, the second time it worked, because part of the work was already done. Either way, you said you tried calling It might be some issue with the current version of Keras. Maybe if you downgrade to a previous stable version it will work? |
I did manage to use the |
And after running the predict it still crashes? That is strange. I tested on Keras 1.2.2 version. That might be an issue with the new version of Keras. Since I still did not migrate fully to Keras 2 I have no idea how to solve that. I am sorry. Let me know if everything works on the version I mentioned. |
Hi,
Thank you for putting some code up for your paper, I enjoyed reading it.
I've been trying for a while to get your code to run and I'm getting this error here:
Normally I'd be inclined to think this is actually a memory error (and maybe it is?? After all, that's exactly what it says), but when I replace the generator with a dummy network (i.e., make the output of the generator simply the input), I still get this error. I'm on a Titan X which has 12GB of memory, and the batch size is 1, so I don't see how this could be possible. Did you guys train on a Titan X or something with a bit more GPU memory?
I am using the latest and greatest Theano + Keras and this is on the libgpuarray backend, which Theano recently switched to.
Any thoughts?
The text was updated successfully, but these errors were encountered: