Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GpuArrayException: Out of memory #5

Open
christopher-beckham opened this issue May 19, 2017 · 13 comments
Open

GpuArrayException: Out of memory #5

christopher-beckham opened this issue May 19, 2017 · 13 comments

Comments

@christopher-beckham
Copy link

Hi,

Thank you for putting some code up for your paper, I enjoyed reading it.

I've been trying for a while to get your code to run and I'm getting this error here:

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/home/cbeckham/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/cbeckham/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/cbeckham/.local/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
  File "train.py", line 78, in discriminator_generator
    b_fake = atob.predict(a_fake.astype("float32"))
  File "/home/cbeckham/.local/lib/python2.7/site-packages/keras/engine/training.py", line 1590, in predict
    batch_size=batch_size, verbose=verbose)
  File "/home/cbeckham/.local/lib/python2.7/site-packages/keras/engine/training.py", line 1217, in _predict_loop
    batch_outs = f(ins_batch)
  File "/home/cbeckham/.local/lib/python2.7/site-packages/keras/backend/theano_backend.py", line 1196, in __call__
    return self.function(*inputs)
  File "/home/cbeckham/.local/lib/python2.7/site-packages/theano/compile/function_module.py", line 898, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))
  File "/home/cbeckham/.local/lib/python2.7/site-packages/theano/gof/link.py", line 325, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/home/cbeckham/.local/lib/python2.7/site-packages/theano/compile/function_module.py", line 884, in __call__
    self.fn() if output_subset is None else\
  File "pygpu/gpuarray.pyx", line 683, in pygpu.gpuarray.pygpu_copy (pygpu/gpuarray.c:9990)
  File "pygpu/gpuarray.pyx", line 396, in pygpu.gpuarray.array_copy (pygpu/gpuarray.c:7083)
GpuArrayException: Out of memory
Apply node that caused the error: GpuContiguous(InplaceGpuDimShuffle{3,2,0,1}.0)
Toposort index: 122
Inputs types: [GpuArrayType<None>(float32, 4D)]
Inputs shapes: [(128, 3, 2, 2)]
Inputs strides: [(4, 512, 3072, 1536)]
Inputs values: ['not shown']
Outputs clients: [[Shape(GpuContiguous.0), GpuDnnConvGradI{algo='none', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty{dtype='float32', context_name=None}.0, GpuDnnConvDesc{border_mode='valid', subsample=(2, 2), dilation=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})]]


HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

Traceback (most recent call last):
  File "train.py", line 369, in <module>
    train(models, it_train, it_val, params)
  File "train.py", line 247, in train
    train_iteration(models, generators, losses, params)
  File "train.py", line 199, in train_iteration
    dhist = train_discriminator(d, d_gen, samples_per_batch=params.samples_per_batch)
  File "train.py", line 101, in train_discriminator
    return d.fit_generator(it, steps_per_epoch=samples_per_batch, epochs=1, verbose=False)
  File "/home/cbeckham/.local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "/home/cbeckham/.local/lib/python2.7/site-packages/keras/engine/training.py", line 1873, in fit_generator
    str(generator_output))
ValueError: output of generator should be a tuple `(x, y, sample_weight)` or `(x, y)`. Found: None

Normally I'd be inclined to think this is actually a memory error (and maybe it is?? After all, that's exactly what it says), but when I replace the generator with a dummy network (i.e., make the output of the generator simply the input), I still get this error. I'm on a Titan X which has 12GB of memory, and the batch size is 1, so I don't see how this could be possible. Did you guys train on a Titan X or something with a bit more GPU memory?

I am using the latest and greatest Theano + Keras and this is on the libgpuarray backend, which Theano recently switched to.

Any thoughts?

@costapt
Copy link
Owner

costapt commented May 19, 2017

Hi!

You should not have memory problems training on a Titan X since we trained our model on GPUs with far less memory.

The issue seems to be related with the python generator that feeds data to the discriminator (not to be confused with the generator network G that generates vessel trees). It seems that the generator d_gen is raising an error and is not returning an (x, y) pair.
The most likely causes of this are the following:

  1. Are you sure you have the data in the correct folder structure? You should have a directory with a 'train' and 'val' folders and each of these folders should have an 'A' and 'B' folders that contain the images.
  2. Are your images in one of these formats: png/jpg/bmp?
  3. Do the images in A have the same name and extension as the images in B?

Please let me know if this solves your problem.

@christopher-beckham
Copy link
Author

christopher-beckham commented May 19, 2017

Hi there,

Thank you for the prompt reply! The answer is yes to all three of your points.

The weird thing is, when I tried debugging the code yesterday, I was repeatedly calling next() on the iterator that supposedly threw the None and it never gave me this. It is really confusing indeed.

If it helps any, I did make a fork of the code and had to change a few things to get the code to work (I'm using Keras 2), namely the change I did in data.py and a little argument parse bug:

christopher-beckham@65d7ac8

Side detail: I wasn't sure why the code had samples_per_batch*2 in the fit_generator instead of just samples_per_batch. I just changed it back to samples_per_batch*2 now (to keep it consistent with your branch) and I still received the same error.

Does anything in that link indicate the issue?

If we reach a dead end, maybe I can send you the data files I'm using. They're just some images from DRIVE that I converted to jpg.

Edit: the exception seems to get thrown when we call b_fake = atob.predict(a_fake). However, if I write some test code to run call predict myself (after getting a_fake from the iterator), it works just fine. Is there something funky going on with fit_generator? I realised it's multi-threaded. Could this be the issue in some way, shape or form? I don't see how the iterator could ever return None -- if I run it myself, it yields samples infinitely (as you would expect).

@costapt
Copy link
Owner

costapt commented May 21, 2017

Hi,

I have seen a similar error but only on the tensorflow backend, not on theano. The workaround I found was make the first call to next(d_gen) on the main thread. This is documented and implemented on the generators_creation method in train.py.

I think that the iterator does not return None but raises an exception. If you want to check it you might want to add a try/except inside the iterator's code and print the exception. That might give a little bit more information. Let me know if this helps.

You say that if you call next(d_gen) it returns a valid x, y pair, but the fit_generator gives an error? If that is true, the fastest workaround I am seeing is to use the train_on_batch method instead of fit_generator. But it should work with the fit_generator...

About the samples_per_batch*2, the discriminator iterator yields a batch with samples_per_batch real pairs and samples_per_batch fake pairs. That is why I use the samples_per_batch*2, but I my guess is that it will not make a big difference.

You could send me the files but the code ran on my computer before, so I guess I will not be able to reproduce your error :(

@christopher-beckham
Copy link
Author

christopher-beckham commented May 21, 2017

Hi,

I just found the bug. The discriminator generator makes a call to the model's predict function, but in the new version of Keras the fit_generator works asynchronously. You can see a similar issue here:

keras-team/keras#3084

Apparently this can be 'fixed' using the solution in this answer, but I have tried that (and many other things) and it still doesn't work out.

Do you have any ideas or workarounds for this?

Edit: I just realised you said the code ran on your machine earlier. What version of Keras are you using?

@costapt
Copy link
Owner

costapt commented May 22, 2017

Hi,

Since the call first call to the iterator is not on the fit_generator method, the model._make_predict_function() (from the StackOverflow's answer) should have no problem. I do not know why is that happening.

I do not have a very clean solution. A workaround for that would be to, instead of using fit_generator, use the following:

x, y = next(d_gen)
model.train_on_batch(x, y)

It will probably be a bit slower but should solve the problem.

If you ever find a better solution for this problem a pull request is most welcomed. In the meantime I will give it a thought and try to search about this issue.

@christopher-beckham
Copy link
Author

Are you using a current version of Keras, or an older version? pip freeze | grep Keras gives me Keras==2.0.4.

With regard to the StackOverflow answer... I tried this but to no avail, e.g. calling _make_predict_function() on all the models before training, and even in the discriminator_generator method before the while loop starts. Did this work for you?

@costapt
Copy link
Owner

costapt commented May 22, 2017

I used an older version of Keras. The latest version prior to the Keras 2.0, I can not confirm the exact version at the moment.

Calling predict for the first time will eventually call the _make_predict_function(). That problem only happened to me on the tensorflow backend but I thought it was solved by doing just that: calling predict for the first time on the main thread (line 170 of train.py).

It is strange that it does not work on the latest version of Keras. Could you please try using the train_on_batch method to see if it works? You just need to change the train_discriminator function to the following:

def train_discriminator(d, it):
    x, y = next(it)
    return d.train_on_batch(x, y, verbose=False)

Of course, this will not have the samples_per_batch number of samples, but if it works we can confirm that the problem lies within the fit_generator method.

@christopher-beckham
Copy link
Author

christopher-beckham commented May 22, 2017

Yes, I can confirm that train_on_batch works, but then the same error comes up again when we have to evaluate the model using evaluate_generator, so that would have to change as well. Of course, the worst-case solution would be to use train_on_batch, but then it looks like we lose out on the potential of using callbacks (unless there is a nice way to incorporate that in -- I am new to Keras).

Edit: what is also weird is that if I keep next(d_gen) in the code, it just hangs forever... I actually have to have this line commented in order to get the memory error.

@costapt
Copy link
Owner

costapt commented May 22, 2017

Yes, you can not use callbacks and I guess that it will be slightly slower. The fit_generator loads the next batch of data while the model is running.

Ohhhh... Then maybe the model is just compiling. The first time it takes a lot of time. Can you check if your processor is being used while it is hanging? There should be some cuda processes running. If that's the case, just leave it running for a while.

@christopher-beckham
Copy link
Author

No I don't think it's compiling anything (I checked for any other running processes). I also had this weird issue where it was also hanging when I was just trying to use the discriminator_iterator (independently of trying to train the net). Then, when I tried an hour later, it worked. It's like there's some sort of weird non-determinism going on somewhere...

@costapt
Copy link
Owner

costapt commented May 22, 2017

The discriminator_iterator probably hangs on the atob.predict call. You can easily check that with pdb or with some prints.

What happens is that, when you use the exact same architecture, the first time it compiles it takes a lot of time but, while it is compiling, theano saves some files that will make it faster the second time you run it. That might explain why when you tried it the first time it hanged and, the second time it worked, because part of the work was already done.

Either way, you said you tried calling atob.predict before running the fit_generator? If you tried that, then your problem must be something else that I am not aware of.

It might be some issue with the current version of Keras. Maybe if you downgrade to a previous stable version it will work?

@christopher-beckham
Copy link
Author

I did manage to use the model.predict successfully outside of fit_generator, with no issues. If you can, do let me know the version of Keras you have! If you have the exact commit too, that would be even better.

@costapt
Copy link
Owner

costapt commented May 24, 2017

And after running the predict it still crashes? That is strange.

I tested on Keras 1.2.2 version.

That might be an issue with the new version of Keras. Since I still did not migrate fully to Keras 2 I have no idea how to solve that. I am sorry.

Let me know if everything works on the version I mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants