-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: simplify chunk size and window size calculation of multiexp #25
Conversation
The chunk size calculation used to be based on the number of CUDA cores. Instead use a fixed number of threads that is then split nicely into blocks for optimal performance. Benchmarks have been run on Quadro RTX 6000, GeForce RTX 2080 Ti and GeForce RTX 3090 to make sure there isn't any big regression on CUDA or OpenCL. In some limited cases, the performance is less good. Though for large number of terms, things got significantly faster, which is what this library is optimized for. Below are the numbers of those runs. For each graphics card the multiexp benchmark was run twice. For each size the better (lower) number of each run was used. It compares the runtime prior to this commit (old) to the runtime with this commit applied (new). The cases where either old or new are better are bold. | GPU | CUDA/OpenCL | Version | 1024 | 2048 | 4096 | 8192 | 16384 | 32768 | 65536 | 131072 | 262144 | 524288 | 1048576 | 2097152 | 4194304 | 8388608 | 16777216 | 33554432 | 67108864 | 134217728 | 268435456 | | --------- | ----------- | ------- | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ----------- | ----------- | ----------- | ----------- | ----------- | | RTX6000 | CUDA | old | 15.751ms | 15.738ms | 17.320ms | 19.831ms | 21.809ms | 29.624ms | **40.618ms** | **65.155ms** | **105.18ms** | **185.14ms** | **272.54ms** | 420.49ms | 790.19ms | 1.4999s | 2.9603s | 5.7531s | 11.434s | 22.874s | 51.522s | | | | new | **14.532ms** | **15.147ms** | **16.314ms** | **19.004ms** | **20.955ms** | **28.732ms** | 41.564ms | 66.707ms | 109.32ms | 186.84ms | 330.00ms | **410.14ms** | **774.19ms** | **1.3458s** | **2.4790s** | **4.7752s** | **9.7055s** | **19.031s** | **41.963s** | | | OpenCL | old | 15.600ms | 15.877ms | 17.590ms | 19.921ms | 22.696ms | 32.388ms | 42.934ms | 71.716ms | 112.10ms | 191.42ms | 341.43ms | 580.50ms | 1.0662s | 2.0566s | 4.1424s | 8.4653s | 16.047s | 35.467s | 72.832s | | | | new | **14.544ms** | **15.009ms** | **16.649ms** | **18.487ms** | **20.833ms** | **27.015ms** | **38.307ms** | **62.836ms** | **102.64ms** | **176.04ms** | **309.27ms** | **526.33ms** | **984.03ms** | **1.8042s** | **3.4835s** | **6.8203s** | **14.003s** | **28.640s** | **58.824s** | | RTX2080Ti | CUDA | old | 10.994ms | 11.244ms | 14.179ms | 15.410ms | 17.584ms | 24.994ms | **31.996ms** | 51.912ms | 93.754ms | **151.65ms** | **221.34ms** | 364.90ms | 677.16ms | 1.3217s | 2.5868s | 5.2162s | 10.402s | 20.883s | 41.937s | | | | new | **10.344ms** | **9.6598ms** | **14.143ms** | **15.274ms** | **17.198ms** | **21.817ms** | 35.415ms | **50.726ms** | **88.575ms** | 153.75ms | 271.72ms | **319.03ms** | **590.31ms** | **1.0330s** | **1.9006s** | **3.8953s** | **7.9497s** | **16.002s** | **32.062s** | | | OpenCl | old | 11.447ms | **11.552ms** | **14.123ms** | 16.393ms | 21.599ms | 27.510ms | 37.208ms | 60.860ms | 105.99ms | 170.86ms | 302.54ms | 523.51ms | 962.37ms | 1.9242s | 3.8334s | 7.7212s | 15.376s | 30.795s | 61.678s | | | | new | **11.140ms** | 11.987ms | 14.837ms | **13.714ms** | **16.898ms** | **24.077ms** | **32.700ms** | **50.925ms** | **87.819ms** | **153.94ms** | **267.78ms** | **467.17ms** | **856.95ms** | **1.6093s** | **3.1487s** | **6.3742s** | **12.894s** | **25.888s** | **52.105s** | | RTX3090 | CUDA | old | 28.924ms | 28.606ms | 29.551ms | **20.608ms** | 33.097ms | 36.271ms | 36.353ms | 43.155ms | 67.801ms | 86.059ms | 150.68ms | 340.78ms | 534.71ms | 985.17ms | 1.7543s | 3.5924s | 7.2819s | 14.658s | 29.133s | | | | new | **15.513ms** | **16.934ms** | **19.606ms** | 23.755ms | **24.186ms** | **28.759ms** | **32.147ms** | **35.125ms** | **50.428ms** | **76.278ms** | **122.85ms** | **206.41ms** | **529.83ms** | **953.46ms** | **1.7170s** | **3.2375s** | **6.6036s** | **13.378s** | **26.999s** | | | OpenCL | old | **18.875ms** | **22.025ms** | 26.669ms | **25.151ms** | 29.823ms | **29.561ms** | **34.674ms** | 43.384ms | 67.859ms | 100.48ms | 174.86ms | 313.63ms | 489.99ms | 899.34ms | 1.5981s | 3.2942s | 6.6854s | 13.473s | 26.754s | | | | new | 21.406ms | 22.300ms | **24.353ms** | 30.037ms | **28.156ms** | 32.799ms | 39.520ms | **41.796ms** | **57.424ms** | **89.439ms** | **147.61ms** | **258.56ms** | **489.23ms** | **865.07ms** | **1.5351s** | **2.8767s** | **5.8899s** | **11.910s** | **24.049s** |
hi,@vmx @cryptonemo, Recently we are trying to port from BLS12-381 to BN254 basing on EF's pairing repo.
Can we try to perfect the raw patch merge it into one of our branches? thanks very much. |
This PR is the first step to simplify things a bit. Over the next week, there will be more PRs, that will change things library quite a bit. Though the outcome sounds like it would be a good match for you. I currently work on making it independent of Hence I propose that you wait for those changes, and then we can see how we can make your use case work. |
nice, looking forward to use the library directly. By the way slightly we have updated performance data results abve. Thank you very much. |
The chunk size calculation used to be based on the number of CUDA cores.
Instead use a fixed number of threads that is then split nicely into
blocks for optimal performance.
Benchmarks have been run on Quadro RTX 6000, GeForce RTX 2080 Ti and
GeForce RTX 3090 to make sure there isn't any big regression on CUDA
or OpenCL. In some limited cases, the performance is less good. Though
for large number of terms, things got significantly faster, which is
what this library is optimized for.
Below are the numbers of those runs. For each graphics card the
multiexp benchmark was run twice. For each size the better (lower)
number of each run was used. It compares the runtime prior to this
commit (old) to the runtime with this commit applied (new). The
cases where either old or new are better are bold.