Polynomial arithmetic implemented in CUDA #2

andrewmilson · 2022-11-13T22:57:53Z

This likely requires a fair bit of work.

Since Metal and CUDA are both C++ based it would be great if field implementations (and other functionality) could be shared between the CUDA and Metal code. Some issues might be the the address space keywords that Metal uses i.e. "constant" (which are being used for constants in field implementations).

Also the first gpu-poly version was written for my M1 Mac. M1 has a unified memory architecture so the memory doesn't have to be moved to and from the GPU. This will no longer be the case if CUDA support is added. Might be worth creating a new Buffer type that abstracts away CPU<->GPU memory movement from the library.

powergun · 2022-11-17T00:25:30Z

wondering if you have considered opencl for wider hardware compatibility?

Since Metal and CUDA are both C++ based it would be great if field implementations (and other functionality) could be shared between the CUDA and Metal code

based on experience, this can be achieved with c++ templating - it would dramatically increase the code complexity and decrease readability, but it requires the least amount of tooling changes (any c++98 compiler could do)

M1 has a unified memory architecture so the memory doesn't have to be moved to and from the GPU. This will no longer be the case if CUDA support is added.

not a direct answer, but I think you could borrow the idea from the c++17 polymorphic allocator:
basically, it lets people write different memory allocator implementations that conform to the new c++17 allocator interface; each allocator could use heap, arena, or device-specific address space; the STL data structures using the allocator interface won't notice the difference and would just work...

andrewmilson · 2022-11-17T03:08:05Z

Thanks Wei.

Online I read OpenCL was ~40% slower than metal and ~30% slower than CUDA. I'd be really curious what kinds of performance an OpenCL implementation of miniSTARK would get though. Also there are some things I'd be excited to try out with the CUDA implementation that I don't think are possible with OpenCL or Metal. For instance the Decoupled Lookback algorithm could be used in a few places to get some significant performance gains.

C++ templating is currently used for the GPU kernels (here for instance). The issue was trying to figure out a nice way to create a type without using keywords specific to the Metal Shader Language. For instance https://github.com/andrewmilson/ministark/blob/main/gpu-poly/src/metal/felt_u64.h.metal#L61 uses the "constant" keyword (won't work if it's removed) which isn't standard C++ as far as I'm aware. I guess all the Ns could be replaced by 18446744069414584321 but I'd really like to keep things readable. If constant can be removed somehow then I think the types can be shared between Metal and CUDA code.

The allocator stuff sounds cool but Metal Shader Language only supports C++14 :(

cheinger · 2023-08-19T04:42:26Z

Hey Andrew!

CUDA has unified memory which abstracts away CPU<->GPU transfers. The memory pages are migrated implicitly by the CUDA driver according to where the memory is accessed. Here are some resources you might find helpful:

Fault-driven migration comes with an additional overhead of the GPU MMU system stalling until the required memory range is available on GPU. To overcome this overhead, you can distribute memory between CPU and GPU with memory mappings from GPU to CPU to facilitate fault-free memory access. Look at cudaMemPrefetch and cudaMemAdvise APIs.

Hope this helps!

andrewmilson · 2023-08-19T13:52:15Z

Hahah the CUDA legend himself! This is super helpful. Thanks mate

andrewmilson added enhancement New feature or request help wanted Extra attention is needed labels Nov 13, 2022

andrewmilson mentioned this issue Nov 14, 2022

Make gpu-poly module more memory safe #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polynomial arithmetic implemented in CUDA #2

Polynomial arithmetic implemented in CUDA #2

andrewmilson commented Nov 13, 2022 •

edited

powergun commented Nov 17, 2022

andrewmilson commented Nov 17, 2022

cheinger commented Aug 19, 2023

andrewmilson commented Aug 19, 2023

Polynomial arithmetic implemented in CUDA #2

Polynomial arithmetic implemented in CUDA #2

Comments

andrewmilson commented Nov 13, 2022 • edited

powergun commented Nov 17, 2022

andrewmilson commented Nov 17, 2022

cheinger commented Aug 19, 2023

andrewmilson commented Aug 19, 2023

andrewmilson commented Nov 13, 2022 •

edited