Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: simplify chunk size and window size calculation of multiexp #25

Merged
merged 1 commit into from
Jul 4, 2022

Commits on Jul 1, 2022

  1. fix: simplify chunk size and window size calculation of multiexp

    The chunk size calculation used to be based on the number of CUDA cores.
    Instead use a fixed number of threads that is then split nicely into
    blocks for optimal performance.
    
    Benchmarks have been run on Quadro RTX 6000, GeForce RTX 2080 Ti and
    GeForce RTX 3090 to make sure there isn't any big regression on CUDA
    or OpenCL. In some limited cases, the performance is less good. Though
    for large number of terms, things got significantly faster, which is
    what this library is optimized for.
    
    Below are the numbers of those runs. For each graphics card the
    multiexp benchmark was run twice. For each size the better (lower)
    number of each run was used. It compares the runtime prior to this
    commit (old) to the runtime with this commit applied (new). The
    cases where either old or new are better are bold.
    
    |   GPU     | CUDA/OpenCL | Version |    1024      |     2048     |     4096     |     8192     |     16384    |     32768    |     65536    |     131072   |     262144   |    524288    |   1048576    |   2097152    |   4194304    |    8388608   |  16777216   |  33554432   |  67108864   |  134217728  |  268435456  |
    | --------- | ----------- | ------- | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ----------- | ----------- | ----------- | ----------- | ----------- |
    | RTX6000   | CUDA        | old     |   15.751ms   |   15.738ms   |   17.320ms   |   19.831ms   |   21.809ms   |   29.624ms   | **40.618ms** | **65.155ms** | **105.18ms** | **185.14ms** | **272.54ms** |   420.49ms   |   790.19ms   |   1.4999s    |   2.9603s   |   5.7531s   |   11.434s   |   22.874s   |   51.522s   |
    |           |             | new     | **14.532ms** | **15.147ms** | **16.314ms** | **19.004ms** | **20.955ms** | **28.732ms** |   41.564ms   |   66.707ms   |   109.32ms   |   186.84ms   |   330.00ms   | **410.14ms** | **774.19ms** | **1.3458s**  | **2.4790s** | **4.7752s** | **9.7055s** | **19.031s** | **41.963s** |
    |           | OpenCL      | old     |   15.600ms   |   15.877ms   |   17.590ms   |   19.921ms   |   22.696ms   |   32.388ms   |   42.934ms   |   71.716ms   |   112.10ms   |   191.42ms   |   341.43ms   |   580.50ms   |    1.0662s   |   2.0566s    |   4.1424s   |   8.4653s   |   16.047s   |   35.467s   |   72.832s   |
    |           |             | new     | **14.544ms** | **15.009ms** | **16.649ms** | **18.487ms** | **20.833ms** | **27.015ms** | **38.307ms** | **62.836ms** | **102.64ms** | **176.04ms** | **309.27ms** | **526.33ms** | **984.03ms** | **1.8042s**  | **3.4835s** | **6.8203s** | **14.003s** | **28.640s** | **58.824s** |
    | RTX2080Ti | CUDA        | old     |   10.994ms   |   11.244ms   |   14.179ms   |   15.410ms   |   17.584ms   |   24.994ms   | **31.996ms** |   51.912ms   |   93.754ms   | **151.65ms** | **221.34ms** |   364.90ms   |   677.16ms   |   1.3217s    |   2.5868s   |   5.2162s   |   10.402s   |   20.883s   |   41.937s   |
    |           |             | new     | **10.344ms** | **9.6598ms** | **14.143ms** | **15.274ms** | **17.198ms** | **21.817ms** |   35.415ms   | **50.726ms** | **88.575ms** |   153.75ms   |   271.72ms   | **319.03ms** | **590.31ms** | **1.0330s**  | **1.9006s** | **3.8953s** | **7.9497s** | **16.002s** | **32.062s** |
    |           | OpenCl      | old     |   11.447ms   | **11.552ms** | **14.123ms** |   16.393ms   |   21.599ms   |   27.510ms   |   37.208ms   |   60.860ms   |   105.99ms   |   170.86ms   |   302.54ms   |   523.51ms   |   962.37ms   |   1.9242s    |   3.8334s   |   7.7212s   |   15.376s   |   30.795s   |   61.678s   |
    |           |             | new     | **11.140ms** |   11.987ms   |   14.837ms   | **13.714ms** | **16.898ms** | **24.077ms** | **32.700ms** | **50.925ms** | **87.819ms** | **153.94ms** | **267.78ms** | **467.17ms** | **856.95ms** | **1.6093s**  | **3.1487s** | **6.3742s** | **12.894s** | **25.888s** | **52.105s** |
    | RTX3090   | CUDA        | old     |   28.924ms   |   28.606ms   |   29.551ms   | **20.608ms** |   33.097ms   |   36.271ms   |   36.353ms   |   43.155ms   |   67.801ms   |   86.059ms   |   150.68ms   |   340.78ms   |   534.71ms   |   985.17ms   |   1.7543s   |   3.5924s   |   7.2819s   |   14.658s   |   29.133s   |
    |           |             | new     | **15.513ms** | **16.934ms** | **19.606ms** |   23.755ms   | **24.186ms** | **28.759ms** | **32.147ms** | **35.125ms** | **50.428ms** | **76.278ms** | **122.85ms** | **206.41ms** | **529.83ms** | **953.46ms** | **1.7170s** | **3.2375s** | **6.6036s** | **13.378s** | **26.999s** |
    |           | OpenCL      | old     | **18.875ms** | **22.025ms** |   26.669ms   | **25.151ms** |   29.823ms   | **29.561ms** | **34.674ms** |   43.384ms   |   67.859ms   |   100.48ms   |   174.86ms   |   313.63ms   |   489.99ms   |   899.34ms   |   1.5981s   |   3.2942s   |   6.6854s   |   13.473s   |   26.754s   |
    |           |             | new     |   21.406ms   |   22.300ms   | **24.353ms** |   30.037ms   | **28.156ms** |   32.799ms   |   39.520ms   | **41.796ms** | **57.424ms** | **89.439ms** | **147.61ms** | **258.56ms** | **489.23ms** | **865.07ms** | **1.5351s** | **2.8767s** | **5.8899s** | **11.910s** | **24.049s** |
    vmx committed Jul 1, 2022
    Configuration menu
    Copy the full SHA
    7cc7424 View commit details
    Browse the repository at this point in the history