fix: simplify chunk size and window size calculation of multiexp #25

vmx · 2022-07-01T11:43:34Z

The chunk size calculation used to be based on the number of CUDA cores.
Instead use a fixed number of threads that is then split nicely into
blocks for optimal performance.

Benchmarks have been run on Quadro RTX 6000, GeForce RTX 2080 Ti and
GeForce RTX 3090 to make sure there isn't any big regression on CUDA
or OpenCL. In some limited cases, the performance is less good. Though
for large number of terms, things got significantly faster, which is
what this library is optimized for.

Below are the numbers of those runs. For each graphics card the
multiexp benchmark was run twice. For each size the better (lower)
number of each run was used. It compares the runtime prior to this
commit (old) to the runtime with this commit applied (new). The
cases where either old or new are better are bold.

GPU	CUDA/OpenCL	Version	1024	2048	4096	8192	16384	32768	65536	131072	262144	524288	1048576	2097152	4194304	8388608	16777216	33554432	67108864	134217728	268435456
RTX6000	CUDA	old	15.751ms	15.738ms	17.320ms	19.831ms	21.809ms	29.624ms	40.618ms	65.155ms	105.18ms	185.14ms	272.54ms	420.49ms	790.19ms	1.4999s	2.9603s	5.7531s	11.434s	22.874s	51.522s
		new	14.532ms	15.147ms	16.314ms	19.004ms	20.955ms	28.732ms	41.564ms	66.707ms	109.32ms	186.84ms	330.00ms	410.14ms	774.19ms	1.3458s	2.4790s	4.7752s	9.7055s	19.031s	41.963s
	OpenCL	old	15.600ms	15.877ms	17.590ms	19.921ms	22.696ms	32.388ms	42.934ms	71.716ms	112.10ms	191.42ms	341.43ms	580.50ms	1.0662s	2.0566s	4.1424s	8.4653s	16.047s	35.467s	72.832s
		new	14.544ms	15.009ms	16.649ms	18.487ms	20.833ms	27.015ms	38.307ms	62.836ms	102.64ms	176.04ms	309.27ms	526.33ms	984.03ms	1.8042s	3.4835s	6.8203s	14.003s	28.640s	58.824s
RTX2080Ti	CUDA	old	10.994ms	11.244ms	14.179ms	15.410ms	17.584ms	24.994ms	31.996ms	51.912ms	93.754ms	151.65ms	221.34ms	364.90ms	677.16ms	1.3217s	2.5868s	5.2162s	10.402s	20.883s	41.937s
		new	10.344ms	9.6598ms	14.143ms	15.274ms	17.198ms	21.817ms	35.415ms	50.726ms	88.575ms	153.75ms	271.72ms	319.03ms	590.31ms	1.0330s	1.9006s	3.8953s	7.9497s	16.002s	32.062s
	OpenCl	old	11.447ms	11.552ms	14.123ms	16.393ms	21.599ms	27.510ms	37.208ms	60.860ms	105.99ms	170.86ms	302.54ms	523.51ms	962.37ms	1.9242s	3.8334s	7.7212s	15.376s	30.795s	61.678s
		new	11.140ms	11.987ms	14.837ms	13.714ms	16.898ms	24.077ms	32.700ms	50.925ms	87.819ms	153.94ms	267.78ms	467.17ms	856.95ms	1.6093s	3.1487s	6.3742s	12.894s	25.888s	52.105s
RTX3090	CUDA	old	28.924ms	28.606ms	29.551ms	20.608ms	33.097ms	36.271ms	36.353ms	43.155ms	67.801ms	86.059ms	150.68ms	340.78ms	534.71ms	985.17ms	1.7543s	3.5924s	7.2819s	14.658s	29.133s
		new	15.513ms	16.934ms	19.606ms	23.755ms	24.186ms	28.759ms	32.147ms	35.125ms	50.428ms	76.278ms	122.85ms	206.41ms	529.83ms	953.46ms	1.7170s	3.2375s	6.6036s	13.378s	26.999s
	OpenCL	old	18.875ms	22.025ms	26.669ms	25.151ms	29.823ms	29.561ms	34.674ms	43.384ms	67.859ms	100.48ms	174.86ms	313.63ms	489.99ms	899.34ms	1.5981s	3.2942s	6.6854s	13.473s	26.754s
		new	21.406ms	22.300ms	24.353ms	30.037ms	28.156ms	32.799ms	39.520ms	41.796ms	57.424ms	89.439ms	147.61ms	258.56ms	489.23ms	865.07ms	1.5351s	2.8767s	5.8899s	11.910s	24.049s

The chunk size calculation used to be based on the number of CUDA cores. Instead use a fixed number of threads that is then split nicely into blocks for optimal performance. Benchmarks have been run on Quadro RTX 6000, GeForce RTX 2080 Ti and GeForce RTX 3090 to make sure there isn't any big regression on CUDA or OpenCL. In some limited cases, the performance is less good. Though for large number of terms, things got significantly faster, which is what this library is optimized for. Below are the numbers of those runs. For each graphics card the multiexp benchmark was run twice. For each size the better (lower) number of each run was used. It compares the runtime prior to this commit (old) to the runtime with this commit applied (new). The cases where either old or new are better are bold. | GPU | CUDA/OpenCL | Version | 1024 | 2048 | 4096 | 8192 | 16384 | 32768 | 65536 | 131072 | 262144 | 524288 | 1048576 | 2097152 | 4194304 | 8388608 | 16777216 | 33554432 | 67108864 | 134217728 | 268435456 | | --------- | ----------- | ------- | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ----------- | ----------- | ----------- | ----------- | ----------- | | RTX6000 | CUDA | old | 15.751ms | 15.738ms | 17.320ms | 19.831ms | 21.809ms | 29.624ms | **40.618ms** | **65.155ms** | **105.18ms** | **185.14ms** | **272.54ms** | 420.49ms | 790.19ms | 1.4999s | 2.9603s | 5.7531s | 11.434s | 22.874s | 51.522s | | | | new | **14.532ms** | **15.147ms** | **16.314ms** | **19.004ms** | **20.955ms** | **28.732ms** | 41.564ms | 66.707ms | 109.32ms | 186.84ms | 330.00ms | **410.14ms** | **774.19ms** | **1.3458s** | **2.4790s** | **4.7752s** | **9.7055s** | **19.031s** | **41.963s** | | | OpenCL | old | 15.600ms | 15.877ms | 17.590ms | 19.921ms | 22.696ms | 32.388ms | 42.934ms | 71.716ms | 112.10ms | 191.42ms | 341.43ms | 580.50ms | 1.0662s | 2.0566s | 4.1424s | 8.4653s | 16.047s | 35.467s | 72.832s | | | | new | **14.544ms** | **15.009ms** | **16.649ms** | **18.487ms** | **20.833ms** | **27.015ms** | **38.307ms** | **62.836ms** | **102.64ms** | **176.04ms** | **309.27ms** | **526.33ms** | **984.03ms** | **1.8042s** | **3.4835s** | **6.8203s** | **14.003s** | **28.640s** | **58.824s** | | RTX2080Ti | CUDA | old | 10.994ms | 11.244ms | 14.179ms | 15.410ms | 17.584ms | 24.994ms | **31.996ms** | 51.912ms | 93.754ms | **151.65ms** | **221.34ms** | 364.90ms | 677.16ms | 1.3217s | 2.5868s | 5.2162s | 10.402s | 20.883s | 41.937s | | | | new | **10.344ms** | **9.6598ms** | **14.143ms** | **15.274ms** | **17.198ms** | **21.817ms** | 35.415ms | **50.726ms** | **88.575ms** | 153.75ms | 271.72ms | **319.03ms** | **590.31ms** | **1.0330s** | **1.9006s** | **3.8953s** | **7.9497s** | **16.002s** | **32.062s** | | | OpenCl | old | 11.447ms | **11.552ms** | **14.123ms** | 16.393ms | 21.599ms | 27.510ms | 37.208ms | 60.860ms | 105.99ms | 170.86ms | 302.54ms | 523.51ms | 962.37ms | 1.9242s | 3.8334s | 7.7212s | 15.376s | 30.795s | 61.678s | | | | new | **11.140ms** | 11.987ms | 14.837ms | **13.714ms** | **16.898ms** | **24.077ms** | **32.700ms** | **50.925ms** | **87.819ms** | **153.94ms** | **267.78ms** | **467.17ms** | **856.95ms** | **1.6093s** | **3.1487s** | **6.3742s** | **12.894s** | **25.888s** | **52.105s** | | RTX3090 | CUDA | old | 28.924ms | 28.606ms | 29.551ms | **20.608ms** | 33.097ms | 36.271ms | 36.353ms | 43.155ms | 67.801ms | 86.059ms | 150.68ms | 340.78ms | 534.71ms | 985.17ms | 1.7543s | 3.5924s | 7.2819s | 14.658s | 29.133s | | | | new | **15.513ms** | **16.934ms** | **19.606ms** | 23.755ms | **24.186ms** | **28.759ms** | **32.147ms** | **35.125ms** | **50.428ms** | **76.278ms** | **122.85ms** | **206.41ms** | **529.83ms** | **953.46ms** | **1.7170s** | **3.2375s** | **6.6036s** | **13.378s** | **26.999s** | | | OpenCL | old | **18.875ms** | **22.025ms** | 26.669ms | **25.151ms** | 29.823ms | **29.561ms** | **34.674ms** | 43.384ms | 67.859ms | 100.48ms | 174.86ms | 313.63ms | 489.99ms | 899.34ms | 1.5981s | 3.2942s | 6.6854s | 13.473s | 26.754s | | | | new | 21.406ms | 22.300ms | **24.353ms** | 30.037ms | **28.156ms** | 32.799ms | 39.520ms | **41.796ms** | **57.424ms** | **89.439ms** | **147.61ms** | **258.56ms** | **489.23ms** | **865.07ms** | **1.5351s** | **2.8767s** | **5.8899s** | **11.910s** | **24.049s** |

bchyl · 2022-07-04T07:50:46Z

hi，@vmx @cryptonemo,

Recently we are trying to port from BLS12-381 to BN254 basing on EF's pairing repo.
The following is the collected performance data for multiexp on Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz 80cores+ 35G mem and 4 T4 gpu:

Testing FFT for 524288 elements...
GPU took 35ms.
CPU (64 cores) took 149ms.
Speedup: x4.257143
============================
Testing FFT for 1048576 elements...
GPU took 67ms.
Testing FFT3 for 1048576 elements...
GPU took 61ms.
CPU (64 cores) took 263ms.
Speedup: x3.925373
============================
Testing FFT for 2097152 elements...
GPU took 102ms.
CPU (64 cores) took 752ms.
Speedup: 12.327868
============================
CPU (64 cores) took 428ms.
Speedup: x4.1960783
============================
test fft::tests::fft ... ok
Testing FFT3 for 2097152 elements...
GPU took 98ms.
CPU (64 cores) took 1275ms.
Speedup: 13.010204
============================
Testing FFT3 for 4194304 elements...
GPU took 182ms.
CPU (64 cores) took 2540ms.
Speedup: 13.956044
============================
Testing FFT3 for 8388608 elements...
GPU took 332ms.
CPU (64 cores) took 5209ms.
Speedup: 15.689759
============================
Testing FFT3 for 16777216 elements...
GPU took 764ms.
CPU (64 cores) took 11522ms.
Speedup: 15.081152
============================

Can we try to perfect the raw patch merge it into one of our branches？

thanks very much.

vmx · 2022-07-04T08:53:07Z

Can we try to perfect the raw patch merge it into one of our branches？

This PR is the first step to simplify things a bit. Over the next week, there will be more PRs, that will change things library quite a bit. Though the outcome sounds like it would be a good match for you. I currently work on making it independent of Engine, but rather use the base and scalar fields directly. This might make it possible to use this library for your use case without any changes.

Hence I propose that you wait for those changes, and then we can see how we can make your use case work.

bchyl · 2022-07-05T07:03:47Z

Can we try to perfect the raw patch merge it into one of our branches？

This PR is the first step to simplify things a bit. Over the next week, there will be more PRs, that will change things library quite a bit. Though the outcome sounds like it would be a good match for you. I currently work on making it independent of Engine, but rather use the base and scalar fields directly. This might make it possible to use this library for your use case without any changes.

Hence I propose that you wait for those changes, and then we can see how we can make your use case work.

nice, looking forward to use the library directly. By the way slightly we have updated performance data results abve. Thank you very much.

vmx requested a review from cryptonemo July 1, 2022 11:43

cryptonemo approved these changes Jul 1, 2022

View reviewed changes

vmx merged commit c68c369 into master Jul 4, 2022

vmx deleted the multiexp-chunks branch July 4, 2022 09:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: simplify chunk size and window size calculation of multiexp #25

fix: simplify chunk size and window size calculation of multiexp #25

vmx commented Jul 1, 2022

bchyl commented Jul 4, 2022 •

edited

Loading

vmx commented Jul 4, 2022

bchyl commented Jul 5, 2022 •

edited

Loading

fix: simplify chunk size and window size calculation of multiexp #25

fix: simplify chunk size and window size calculation of multiexp #25

Conversation

vmx commented Jul 1, 2022

bchyl commented Jul 4, 2022 • edited Loading

vmx commented Jul 4, 2022

bchyl commented Jul 5, 2022 • edited Loading

bchyl commented Jul 4, 2022 •

edited

Loading

bchyl commented Jul 5, 2022 •

edited

Loading