The chunk size calculation used to be based on the number of CUDA cores.
Instead use a fixed number of threads that is then split nicely into
blocks for optimal performance.
Benchmarks have been run on Quadro RTX 6000, GeForce RTX 2080 Ti and
GeForce RTX 3090 to make sure there isn't any big regression on CUDA
or OpenCL. In some limited cases, the performance is less good. Though
for large number of terms, things got significantly faster, which is
what this library is optimized for.
Below are the numbers of those runs. For each graphics card the
multiexp benchmark was run twice. For each size the better (lower)
number of each run was used. It compares the runtime prior to this
commit (old) to the runtime with this commit applied (new). The
cases where either old or new are better are bold.
| GPU | CUDA/OpenCL | Version | 1024 | 2048 | 4096 | 8192 | 16384 | 32768 | 65536 | 131072 | 262144 | 524288 | 1048576 | 2097152 | 4194304 | 8388608 | 16777216 | 33554432 | 67108864 | 134217728 | 268435456 |
| --------- | ----------- | ------- | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ----------- | ----------- | ----------- | ----------- | ----------- |
| RTX6000 | CUDA | old | 15.751ms | 15.738ms | 17.320ms | 19.831ms | 21.809ms | 29.624ms | **40.618ms** | **65.155ms** | **105.18ms** | **185.14ms** | **272.54ms** | 420.49ms | 790.19ms | 1.4999s | 2.9603s | 5.7531s | 11.434s | 22.874s | 51.522s |
| | | new | **14.532ms** | **15.147ms** | **16.314ms** | **19.004ms** | **20.955ms** | **28.732ms** | 41.564ms | 66.707ms | 109.32ms | 186.84ms | 330.00ms | **410.14ms** | **774.19ms** | **1.3458s** | **2.4790s** | **4.7752s** | **9.7055s** | **19.031s** | **41.963s** |
| | OpenCL | old | 15.600ms | 15.877ms | 17.590ms | 19.921ms | 22.696ms | 32.388ms | 42.934ms | 71.716ms | 112.10ms | 191.42ms | 341.43ms | 580.50ms | 1.0662s | 2.0566s | 4.1424s | 8.4653s | 16.047s | 35.467s | 72.832s |
| | | new | **14.544ms** | **15.009ms** | **16.649ms** | **18.487ms** | **20.833ms** | **27.015ms** | **38.307ms** | **62.836ms** | **102.64ms** | **176.04ms** | **309.27ms** | **526.33ms** | **984.03ms** | **1.8042s** | **3.4835s** | **6.8203s** | **14.003s** | **28.640s** | **58.824s** |
| RTX2080Ti | CUDA | old | 10.994ms | 11.244ms | 14.179ms | 15.410ms | 17.584ms | 24.994ms | **31.996ms** | 51.912ms | 93.754ms | **151.65ms** | **221.34ms** | 364.90ms | 677.16ms | 1.3217s | 2.5868s | 5.2162s | 10.402s | 20.883s | 41.937s |
| | | new | **10.344ms** | **9.6598ms** | **14.143ms** | **15.274ms** | **17.198ms** | **21.817ms** | 35.415ms | **50.726ms** | **88.575ms** | 153.75ms | 271.72ms | **319.03ms** | **590.31ms** | **1.0330s** | **1.9006s** | **3.8953s** | **7.9497s** | **16.002s** | **32.062s** |
| | OpenCl | old | 11.447ms | **11.552ms** | **14.123ms** | 16.393ms | 21.599ms | 27.510ms | 37.208ms | 60.860ms | 105.99ms | 170.86ms | 302.54ms | 523.51ms | 962.37ms | 1.9242s | 3.8334s | 7.7212s | 15.376s | 30.795s | 61.678s |
| | | new | **11.140ms** | 11.987ms | 14.837ms | **13.714ms** | **16.898ms** | **24.077ms** | **32.700ms** | **50.925ms** | **87.819ms** | **153.94ms** | **267.78ms** | **467.17ms** | **856.95ms** | **1.6093s** | **3.1487s** | **6.3742s** | **12.894s** | **25.888s** | **52.105s** |
| RTX3090 | CUDA | old | 28.924ms | 28.606ms | 29.551ms | **20.608ms** | 33.097ms | 36.271ms | 36.353ms | 43.155ms | 67.801ms | 86.059ms | 150.68ms | 340.78ms | 534.71ms | 985.17ms | 1.7543s | 3.5924s | 7.2819s | 14.658s | 29.133s |
| | | new | **15.513ms** | **16.934ms** | **19.606ms** | 23.755ms | **24.186ms** | **28.759ms** | **32.147ms** | **35.125ms** | **50.428ms** | **76.278ms** | **122.85ms** | **206.41ms** | **529.83ms** | **953.46ms** | **1.7170s** | **3.2375s** | **6.6036s** | **13.378s** | **26.999s** |
| | OpenCL | old | **18.875ms** | **22.025ms** | 26.669ms | **25.151ms** | 29.823ms | **29.561ms** | **34.674ms** | 43.384ms | 67.859ms | 100.48ms | 174.86ms | 313.63ms | 489.99ms | 899.34ms | 1.5981s | 3.2942s | 6.6854s | 13.473s | 26.754s |
| | | new | 21.406ms | 22.300ms | **24.353ms** | 30.037ms | **28.156ms** | 32.799ms | 39.520ms | **41.796ms** | **57.424ms** | **89.439ms** | **147.61ms** | **258.56ms** | **489.23ms** | **865.07ms** | **1.5351s** | **2.8767s** | **5.8899s** | **11.910s** | **24.049s** |