Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not enough GPU ram ... make a "slow" option #31

Closed
VolkerH opened this issue Apr 5, 2019 · 12 comments
Closed

Not enough GPU ram ... make a "slow" option #31

VolkerH opened this issue Apr 5, 2019 · 12 comments

Comments

@VolkerH
Copy link
Owner

VolkerH commented Apr 5, 2019

both gputools (which used reika for fft) and flowdec run out of GPU memory when trying to process a stack of size (151, 800, 600).

Depending on what I am trying do exactly the error message in tensorflow either shows up when initializing the batch cufft plan or later on when allocating space for a tensor.

One of the error messages I saw is that it was trying to allocate a tensor of size 256, 1024, 1024. When I crop the volume by 23 pixes in Z (this would correspond to 128 z slices), everything works fine. When I only crop 22 pixels it fails.

flowdec rounds up the sizes to the next size where the fastest FFT can be performed. This appears to be very generous rounding. It would be nice to be able to trade some speed for the ability to process such volumes. I should look into adding an option to round up to the next size for which an FFT can be performed, even if it is not optimal in terms of speed.

@VolkerH
Copy link
Owner Author

VolkerH commented Apr 5, 2019

Also tested whether real_mode_fft requires less video ram but this is not the case.

Coincidentally, I just noticed this:
tlambert03/pycudadecon#7

So I guess I don't need to test with pycudadeconv either

@VolkerH
Copy link
Owner Author

VolkerH commented Apr 5, 2019

this is the code I need to look at

def optimize_dims(dims, mode): in
https://github.com/hammerlab/flowdec/blob/master/python/flowdec/fft_utils_tf.py

fall back to Bluestein for arrays that are too large.

@VolkerH
Copy link
Owner Author

VolkerH commented Apr 5, 2019

Turns out that was easy. The mode that gets passed into optimize_dims can be set when initializing the deconvolver class in flowdec using the named argument pad_mode. In the specific example mentioned above this allows deconvolution on the GPU. Haven't benchmarked it but it is not much slower. However, I appear to obtain more artefacts at the image boundary.

Will add this as a commandline option

@dmilkie
Copy link

dmilkie commented Apr 5, 2019 via email

@VolkerH
Copy link
Owner Author

VolkerH commented Apr 6, 2019

@dmilkie
Thanks for the comment and I agree: that is exactly the type of artefact that I was seeing when disabling padding.

I will need to check a few more things in the flowdec source code, I believe if I provide images that already have optimal dimensions for the FFT it will not perform any padding by default (but I might be wrong). So far, most of the stacks I put through there had dimensions that would have been very generously rounded up and padded (not sure about the default fill strategy, also will have to check the source code) so the output always looked quite artefact-free.

@dmilkie
Copy link

dmilkie commented Apr 6, 2019 via email

@eric-czech
Copy link

FWIW, the default behavior is to pad to the next highest power of 2 using the "reflect" mode shown here or to do nothing if the length along any one axis is already a power of 2. I think you probably saw this but you can also pass pad_mode='none' to just turn off any of the automatic padding/cropping if you wanted to take care of that externally.

@VolkerH
Copy link
Owner Author

VolkerH commented Apr 7, 2019

Nice. I believe there is a “mirror” pad option in there. Regarding VRAM bloat An other thing to check is the precision of the calculations. If they are double that’s too much. Smaller gain would be to check the FFTs. if they are complex to complex (instead of complex to real and real to complex) that’ll cost you. Also I just implemented a shared workArea in the cudaDecon code so the two FFT plans use the same space (since they execute serially) which saved roughly a single copy of the data (out of the 7 or so it needs). Small gain but it’s something.

I am impressed about the level of such nice tweaks that are implemented in cudaDecon. I had already looked at whether it is possible to use lower precision (such as float16/complex32) to save VRAM. However, tensorflow does not provide that level of control it is either complex64 or complex128 (see https://www.tensorflow.org/api_docs/python/tf/signal/fft3d). Also, tensorflow doesn't give much option over the FFT plan creation/storage.

I was wondering whether tensorflow could automatically fall back to storing arrays in CPU memory when there is not enough VRAM. From what I understand this is happening for operations which are not supported on the GPU if the option allow_soft_placement is given, but apparently tensorflow does not do this based on memory considerations.

FWIW, the default behavior is to pad to the next highest power of 2 using the "reflect" mode shown here or to do nothing if the length along any one axis is already a power of 2. I think you probably saw this but you can also pass pad_mode='none' to just turn off any of the automatic padding/cropping if you wanted to take care of that externally.

Yes, I did see that and setting pad_mode='none' allowed me to deconvolve a volume that otherwise gave error messages due to lack of VRAM. To ensure optimum quality whenever possible, one would probably have to have a hiearchy like in this pseudocode:

increase input size (origsize) by at least half the width of the PSF along each dimension  -> (newsize)
increase newsize to nearest size optimal for speed -> (optimalnewsize)
try allocating graph for optimalnewsize:
Catch vram_exception:
     try allocating graph for newsize:
          catch vram_exception:
              try_allocating graph for origsize (warn about potential artefacts)
     finally:
          fall back to storing in main memory rather than VRAM.

This gets complicated rather quickly, but I think I will implement the first part of it to ensure consistent quality regardless of the input size.

@dmilkie
Copy link

dmilkie commented Apr 8, 2019 via email

@VolkerH
Copy link
Owner Author

VolkerH commented Apr 8, 2019

The Biggs Thesis from the U of Auckland (https://researchspace.auckland.ac.nz/handle/2292/1760). I skimmed it but haven't found the time to read it.

However, now that your cudaDeconv code has been open-sourced it becomes a bit more difficult to justify putting the effort in (it is fun though and the lessons learnt will definitely be useful in other projects). I think I need to lock down the feature set I want for now and create a useable and easily installable version.

Back to VRAM utilization:
The approach I am using of deskewing on the raw data (by resampling and skewing the PSF) can potentially save considerable memory, I could also try that with cudaDeconv.

@dmilkie
Copy link

dmilkie commented Apr 8, 2019 via email

@VolkerH
Copy link
Owner Author

VolkerH commented Apr 10, 2019

Thanks for the link to the paper and the suggestions.

I will close this issue as I've addressed some of the discussed issues via this branch
#33

@VolkerH VolkerH closed this as completed Apr 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants