Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ispc backend failing on large ranges #2109

Closed
spakin opened this issue Feb 16, 2024 · 9 comments
Closed

ispc backend failing on large ranges #2109

spakin opened this issue Feb 16, 2024 · 9 comments
Assignees

Comments

@spakin
Copy link

spakin commented Feb 16, 2024

Here's a reproducer, badness.fut:

def main (n : i64) : []i64 =
  filter (\x -> x >= 100 && x <= 110) (0i64..<(1i64<<n))

This works as expected with the C backend:

$ futhark c badness.fut && echo 30 | ./badness
[100i64, 101i64, 102i64, 103i64, 104i64, 105i64, 106i64, 107i64, 108i64, 109i64, 110i64]
$ futhark c badness.fut && echo 31 | ./badness
[100i64, 101i64, 102i64, 103i64, 104i64, 105i64, 106i64, 107i64, 108i64, 109i64, 110i64]
$ futhark c badness.fut && echo 32 | ./badness
[100i64, 101i64, 102i64, 103i64, 104i64, 105i64, 106i64, 107i64, 108i64, 109i64, 110i64]

The ISPC backend is not so happy, though:

$ futhark ispc badness.fut && echo 30 | ./badness
[100i64, 101i64, 102i64, 103i64, 104i64, 105i64, 106i64, 107i64, 108i64, 109i64, 110i64]
$ futhark ispc badness.fut && echo 31 | ./badness
empty([0]i64)
$ futhark ispc badness.fut && echo 32 | ./badness
Segmentation fault (core dumped)

Here's what I'm running:

$ uname -a
Linux zapp 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
$ futhark -V
Futhark 0.25.13
git: e65db61dc0e9293708b91ddd07b21d2ef9f31518
Copyright (C) DIKU, University of Copenhagen, released under the ISC license.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
$ ispc -v
Intel(r) Implicit SPMD Program Compiler (Intel(r) ISPC), 1.22.0 (build commit bd2c42d42e0cc3da @ 20231116, LLVM 16.0.6)
@spakin
Copy link
Author

spakin commented Feb 17, 2024

It looks like the CUDA backend has similar but not identical problems to the ISPC backend:

$ env CFLAGS="-I$CUDA_HOME/include -O -L$CUDA_HOME/lib64" futhark-aarch64 cuda badness.fut && echo 30 | ./badness 
Warning: device compute capability is 9.0, but newest supported by Futhark is 8.7.
[100i64, 101i64, 102i64, 103i64, 104i64, 105i64, 106i64, 107i64, 108i64, 109i64, 110i64]
$ env CFLAGS="-I$CUDA_HOME/include -O -L$CUDA_HOME/lib64" futhark-aarch64 cuda badness.fut && echo 31 | ./badness 
Warning: device compute capability is 9.0, but newest supported by Futhark is 8.7.
[100i64, 101i64, 102i64, 103i64, 104i64, 105i64, 106i64, 107i64, 108i64, 109i64, 110i64]
$ env CFLAGS="-I$CUDA_HOME/include -O -L$CUDA_HOME/lib64" futhark-aarch64 cuda badness.fut && echo 32 | ./badness 
Warning: device compute capability is 9.0, but newest supported by Futhark is 8.7.
./badness: badness.c:7153: CUDA call
  cuCtxSynchronize()
failed with error code 700 (an illegal memory access was encountered)

This is on early Grace Hopper hardware (ARM CPU + H100 GPU).

@athas athas self-assigned this Feb 17, 2024
@athas
Copy link
Member

athas commented Feb 17, 2024

Well! That is not great. And on the AMD RX7900 at home, running your program is a very efficient way to shut down the graphical display.

Fortunately, I don't think this is difficult to fix. It's probably an artifact of the old 32-bit size handling, which still lurks in some calculations in the code generator, but the foundations are 64-bit clean. I will take a look.

@athas
Copy link
Member

athas commented Feb 17, 2024

The OpenCL backend works, so it's likely a problem in the single pass scan, which is used for the CUDA and HIP backends.

The multicore backend also works, so the ISPC error is due to something ISPC-specific.

@athas
Copy link
Member

athas commented Feb 17, 2024

For the GPU backends, this might actually just be an OOM error. Filtering is surprisingly memory expensive. With the GPU backends, n=29 requires 12GiB of memory. Presumably, n=30 would require 24GiB, n=31 48GiB, and n=32 96GiB - the latter beyond even what a H100 possesses. It's just a coincidence that this is somewhat close to the 32-bit barrier.

The reason for the memory usage is as follows.

  • Creating the array to be filtered: 8n bytes.
  • Creating a true/false mask of booleans: n bytes.
  • Offset array: 8n bytes.
  • Creating an output array into which the filter result will be put: 8n bytes.

The mask array is fused with the scan producing the offset array, and so doesn't take any memory. I suppose there is no reason for the output array to be so large, however - I think it is only because our filter is actually implemented as partition.

We should handle GPU OOM better. This has been on my list for a while, but the GPU APIs make it surprisingly difficult to do robustly.

The ISPC error is probably a real 32-bit issue however, and I think I remember why: ISPC is very slow when asked to do 64-bit index arithmetic, so we reasoned nobody would want to use it with such large arrays anyway.

athas added a commit that referenced this issue Feb 17, 2024
@spakin
Copy link
Author

spakin commented Feb 19, 2024

I didn't realize that a literal range would still require full-size array construction. Thanks for explaining where all the memory is going.

I just checked, and my Grace Hopper node has 80GiB of GPU memory and 512GiB of CPU memory. I assume this implies that Futhark is not using Unified Memory. That would be a nice option for the CUDA backend if you're looking for more work. 😀 I've heard that the penalty for GPUs accessing CPU memory is a lot lower on Grace Hopper than on previous systems, but I haven't yet tested that myself.

@athas
Copy link
Member

athas commented Feb 19, 2024

I do in fact consider just enabling unified memory unconditionally on the CUDA backend. I did some experiments recently, and it doesn't seem to cause any overhead for the cases where you stay within GPU memory anyway (and let you finish execution when you don't).

@FluxusMagna
Copy link

@athas Would you add unified memory to the HIP backend as well then?

@athas
Copy link
Member

athas commented Feb 19, 2024

If it performs similarly, I don't see why not. But it would also be easy to make a configuration option indicating which kind of memory you prefer, as the allocation APIs are pretty much identical either way.

In the longer term, this would also make some operations more efficient (such as directly indexing Futhark arrays from the host), but just making oversize allocations possible would be an easy start.

@athas
Copy link
Member

athas commented Jun 6, 2024

Unified memory is now enabled by default on CUDA (if the device claims to support it). Not on HIP, because it seems to cause slowdowns sometimes.

@athas athas closed this as completed Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants