Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: simplify chunk size and window size calculation of multiexp #25

Merged
merged 1 commit into from
Jul 4, 2022

Conversation

vmx
Copy link
Contributor

@vmx vmx commented Jul 1, 2022

The chunk size calculation used to be based on the number of CUDA cores.
Instead use a fixed number of threads that is then split nicely into
blocks for optimal performance.

Benchmarks have been run on Quadro RTX 6000, GeForce RTX 2080 Ti and
GeForce RTX 3090 to make sure there isn't any big regression on CUDA
or OpenCL. In some limited cases, the performance is less good. Though
for large number of terms, things got significantly faster, which is
what this library is optimized for.

Below are the numbers of those runs. For each graphics card the
multiexp benchmark was run twice. For each size the better (lower)
number of each run was used. It compares the runtime prior to this
commit (old) to the runtime with this commit applied (new). The
cases where either old or new are better are bold.

GPU CUDA/OpenCL Version 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 33554432 67108864 134217728 268435456
RTX6000 CUDA old 15.751ms 15.738ms 17.320ms 19.831ms 21.809ms 29.624ms 40.618ms 65.155ms 105.18ms 185.14ms 272.54ms 420.49ms 790.19ms 1.4999s 2.9603s 5.7531s 11.434s 22.874s 51.522s
new 14.532ms 15.147ms 16.314ms 19.004ms 20.955ms 28.732ms 41.564ms 66.707ms 109.32ms 186.84ms 330.00ms 410.14ms 774.19ms 1.3458s 2.4790s 4.7752s 9.7055s 19.031s 41.963s
OpenCL old 15.600ms 15.877ms 17.590ms 19.921ms 22.696ms 32.388ms 42.934ms 71.716ms 112.10ms 191.42ms 341.43ms 580.50ms 1.0662s 2.0566s 4.1424s 8.4653s 16.047s 35.467s 72.832s
new 14.544ms 15.009ms 16.649ms 18.487ms 20.833ms 27.015ms 38.307ms 62.836ms 102.64ms 176.04ms 309.27ms 526.33ms 984.03ms 1.8042s 3.4835s 6.8203s 14.003s 28.640s 58.824s
RTX2080Ti CUDA old 10.994ms 11.244ms 14.179ms 15.410ms 17.584ms 24.994ms 31.996ms 51.912ms 93.754ms 151.65ms 221.34ms 364.90ms 677.16ms 1.3217s 2.5868s 5.2162s 10.402s 20.883s 41.937s
new 10.344ms 9.6598ms 14.143ms 15.274ms 17.198ms 21.817ms 35.415ms 50.726ms 88.575ms 153.75ms 271.72ms 319.03ms 590.31ms 1.0330s 1.9006s 3.8953s 7.9497s 16.002s 32.062s
OpenCl old 11.447ms 11.552ms 14.123ms 16.393ms 21.599ms 27.510ms 37.208ms 60.860ms 105.99ms 170.86ms 302.54ms 523.51ms 962.37ms 1.9242s 3.8334s 7.7212s 15.376s 30.795s 61.678s
new 11.140ms 11.987ms 14.837ms 13.714ms 16.898ms 24.077ms 32.700ms 50.925ms 87.819ms 153.94ms 267.78ms 467.17ms 856.95ms 1.6093s 3.1487s 6.3742s 12.894s 25.888s 52.105s
RTX3090 CUDA old 28.924ms 28.606ms 29.551ms 20.608ms 33.097ms 36.271ms 36.353ms 43.155ms 67.801ms 86.059ms 150.68ms 340.78ms 534.71ms 985.17ms 1.7543s 3.5924s 7.2819s 14.658s 29.133s
new 15.513ms 16.934ms 19.606ms 23.755ms 24.186ms 28.759ms 32.147ms 35.125ms 50.428ms 76.278ms 122.85ms 206.41ms 529.83ms 953.46ms 1.7170s 3.2375s 6.6036s 13.378s 26.999s
OpenCL old 18.875ms 22.025ms 26.669ms 25.151ms 29.823ms 29.561ms 34.674ms 43.384ms 67.859ms 100.48ms 174.86ms 313.63ms 489.99ms 899.34ms 1.5981s 3.2942s 6.6854s 13.473s 26.754s
new 21.406ms 22.300ms 24.353ms 30.037ms 28.156ms 32.799ms 39.520ms 41.796ms 57.424ms 89.439ms 147.61ms 258.56ms 489.23ms 865.07ms 1.5351s 2.8767s 5.8899s 11.910s 24.049s

The chunk size calculation used to be based on the number of CUDA cores.
Instead use a fixed number of threads that is then split nicely into
blocks for optimal performance.

Benchmarks have been run on Quadro RTX 6000, GeForce RTX 2080 Ti and
GeForce RTX 3090 to make sure there isn't any big regression on CUDA
or OpenCL. In some limited cases, the performance is less good. Though
for large number of terms, things got significantly faster, which is
what this library is optimized for.

Below are the numbers of those runs. For each graphics card the
multiexp benchmark was run twice. For each size the better (lower)
number of each run was used. It compares the runtime prior to this
commit (old) to the runtime with this commit applied (new). The
cases where either old or new are better are bold.

|   GPU     | CUDA/OpenCL | Version |    1024      |     2048     |     4096     |     8192     |     16384    |     32768    |     65536    |     131072   |     262144   |    524288    |   1048576    |   2097152    |   4194304    |    8388608   |  16777216   |  33554432   |  67108864   |  134217728  |  268435456  |
| --------- | ----------- | ------- | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ----------- | ----------- | ----------- | ----------- | ----------- |
| RTX6000   | CUDA        | old     |   15.751ms   |   15.738ms   |   17.320ms   |   19.831ms   |   21.809ms   |   29.624ms   | **40.618ms** | **65.155ms** | **105.18ms** | **185.14ms** | **272.54ms** |   420.49ms   |   790.19ms   |   1.4999s    |   2.9603s   |   5.7531s   |   11.434s   |   22.874s   |   51.522s   |
|           |             | new     | **14.532ms** | **15.147ms** | **16.314ms** | **19.004ms** | **20.955ms** | **28.732ms** |   41.564ms   |   66.707ms   |   109.32ms   |   186.84ms   |   330.00ms   | **410.14ms** | **774.19ms** | **1.3458s**  | **2.4790s** | **4.7752s** | **9.7055s** | **19.031s** | **41.963s** |
|           | OpenCL      | old     |   15.600ms   |   15.877ms   |   17.590ms   |   19.921ms   |   22.696ms   |   32.388ms   |   42.934ms   |   71.716ms   |   112.10ms   |   191.42ms   |   341.43ms   |   580.50ms   |    1.0662s   |   2.0566s    |   4.1424s   |   8.4653s   |   16.047s   |   35.467s   |   72.832s   |
|           |             | new     | **14.544ms** | **15.009ms** | **16.649ms** | **18.487ms** | **20.833ms** | **27.015ms** | **38.307ms** | **62.836ms** | **102.64ms** | **176.04ms** | **309.27ms** | **526.33ms** | **984.03ms** | **1.8042s**  | **3.4835s** | **6.8203s** | **14.003s** | **28.640s** | **58.824s** |
| RTX2080Ti | CUDA        | old     |   10.994ms   |   11.244ms   |   14.179ms   |   15.410ms   |   17.584ms   |   24.994ms   | **31.996ms** |   51.912ms   |   93.754ms   | **151.65ms** | **221.34ms** |   364.90ms   |   677.16ms   |   1.3217s    |   2.5868s   |   5.2162s   |   10.402s   |   20.883s   |   41.937s   |
|           |             | new     | **10.344ms** | **9.6598ms** | **14.143ms** | **15.274ms** | **17.198ms** | **21.817ms** |   35.415ms   | **50.726ms** | **88.575ms** |   153.75ms   |   271.72ms   | **319.03ms** | **590.31ms** | **1.0330s**  | **1.9006s** | **3.8953s** | **7.9497s** | **16.002s** | **32.062s** |
|           | OpenCl      | old     |   11.447ms   | **11.552ms** | **14.123ms** |   16.393ms   |   21.599ms   |   27.510ms   |   37.208ms   |   60.860ms   |   105.99ms   |   170.86ms   |   302.54ms   |   523.51ms   |   962.37ms   |   1.9242s    |   3.8334s   |   7.7212s   |   15.376s   |   30.795s   |   61.678s   |
|           |             | new     | **11.140ms** |   11.987ms   |   14.837ms   | **13.714ms** | **16.898ms** | **24.077ms** | **32.700ms** | **50.925ms** | **87.819ms** | **153.94ms** | **267.78ms** | **467.17ms** | **856.95ms** | **1.6093s**  | **3.1487s** | **6.3742s** | **12.894s** | **25.888s** | **52.105s** |
| RTX3090   | CUDA        | old     |   28.924ms   |   28.606ms   |   29.551ms   | **20.608ms** |   33.097ms   |   36.271ms   |   36.353ms   |   43.155ms   |   67.801ms   |   86.059ms   |   150.68ms   |   340.78ms   |   534.71ms   |   985.17ms   |   1.7543s   |   3.5924s   |   7.2819s   |   14.658s   |   29.133s   |
|           |             | new     | **15.513ms** | **16.934ms** | **19.606ms** |   23.755ms   | **24.186ms** | **28.759ms** | **32.147ms** | **35.125ms** | **50.428ms** | **76.278ms** | **122.85ms** | **206.41ms** | **529.83ms** | **953.46ms** | **1.7170s** | **3.2375s** | **6.6036s** | **13.378s** | **26.999s** |
|           | OpenCL      | old     | **18.875ms** | **22.025ms** |   26.669ms   | **25.151ms** |   29.823ms   | **29.561ms** | **34.674ms** |   43.384ms   |   67.859ms   |   100.48ms   |   174.86ms   |   313.63ms   |   489.99ms   |   899.34ms   |   1.5981s   |   3.2942s   |   6.6854s   |   13.473s   |   26.754s   |
|           |             | new     |   21.406ms   |   22.300ms   | **24.353ms** |   30.037ms   | **28.156ms** |   32.799ms   |   39.520ms   | **41.796ms** | **57.424ms** | **89.439ms** | **147.61ms** | **258.56ms** | **489.23ms** | **865.07ms** | **1.5351s** | **2.8767s** | **5.8899s** | **11.910s** | **24.049s** |
@vmx vmx requested a review from cryptonemo July 1, 2022 11:43
@bchyl
Copy link

bchyl commented Jul 4, 2022

hi,@vmx @cryptonemo,

Recently we are trying to port from BLS12-381 to BN254 basing on EF's pairing repo.
The following is the collected performance data for multiexp on Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz 80cores+ 35G mem and 4 T4 gpu:

Testing FFT for 524288 elements...
GPU took 35ms.
CPU (64 cores) took 149ms.
Speedup: x4.257143
============================
Testing FFT for 1048576 elements...
GPU took 67ms.
Testing FFT3 for 1048576 elements...
GPU took 61ms.
CPU (64 cores) took 263ms.
Speedup: x3.925373
============================
Testing FFT for 2097152 elements...
GPU took 102ms.
CPU (64 cores) took 752ms.
Speedup: 12.327868
============================
CPU (64 cores) took 428ms.
Speedup: x4.1960783
============================
test fft::tests::fft ... ok
Testing FFT3 for 2097152 elements...
GPU took 98ms.
CPU (64 cores) took 1275ms.
Speedup: 13.010204
============================
Testing FFT3 for 4194304 elements...
GPU took 182ms.
CPU (64 cores) took 2540ms.
Speedup: 13.956044
============================
Testing FFT3 for 8388608 elements...
GPU took 332ms.
CPU (64 cores) took 5209ms.
Speedup: 15.689759
============================
Testing FFT3 for 16777216 elements...
GPU took 764ms.
CPU (64 cores) took 11522ms.
Speedup: 15.081152
============================

Can we try to perfect the raw patch merge it into one of our branches?

thanks very much.

@vmx
Copy link
Contributor Author

vmx commented Jul 4, 2022

Can we try to perfect the raw patch merge it into one of our branches?

This PR is the first step to simplify things a bit. Over the next week, there will be more PRs, that will change things library quite a bit. Though the outcome sounds like it would be a good match for you. I currently work on making it independent of Engine, but rather use the base and scalar fields directly. This might make it possible to use this library for your use case without any changes.

Hence I propose that you wait for those changes, and then we can see how we can make your use case work.

@vmx vmx merged commit c68c369 into master Jul 4, 2022
@vmx vmx deleted the multiexp-chunks branch July 4, 2022 09:15
@bchyl
Copy link

bchyl commented Jul 5, 2022

Can we try to perfect the raw patch merge it into one of our branches?

This PR is the first step to simplify things a bit. Over the next week, there will be more PRs, that will change things library quite a bit. Though the outcome sounds like it would be a good match for you. I currently work on making it independent of Engine, but rather use the base and scalar fields directly. This might make it possible to use this library for your use case without any changes.

Hence I propose that you wait for those changes, and then we can see how we can make your use case work.

nice, looking forward to use the library directly. By the way slightly we have updated performance data results abve. Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants