LUCENE-9636: Extract and operation to get a SIMD optimize #2139

gf2121 · 2020-12-10T10:23:55Z

Description

In decode6() decode7() decode12() decode14() decode15() decode24(), longs always & a same mask and do some shift. By printing assemble language, i find that JIT did not optimize them with SIMD instructions. But when we extract all & operations and do them first, JIT will use SIMD to optimize them.

Tests

Java Version:

java version "11.0.6" 2020-01-14 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.6+8-LTS)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.6+8-LTS, mixed mode)

Method Benchmark

Using decode15() as an example, here is a microbenchmark based on JMH:
code:

    @Benchmark
    @BenchmarkMode({Mode.Throughput})
    @Fork(1)
    @Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
    @Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
    public void decode15a() {
        for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) {
            long l0 = (TMP[tmpIdx+0] & MASK16_1) << 14;
            l0 |= (TMP[tmpIdx+1] & MASK16_1) << 13;
            l0 |= (TMP[tmpIdx+2] & MASK16_1) << 12;
            l0 |= (TMP[tmpIdx+3] & MASK16_1) << 11;
            l0 |= (TMP[tmpIdx+4] & MASK16_1) << 10;
            l0 |= (TMP[tmpIdx+5] & MASK16_1) << 9;
            l0 |= (TMP[tmpIdx+6] & MASK16_1) << 8;
            l0 |= (TMP[tmpIdx+7] & MASK16_1) << 7;
            l0 |= (TMP[tmpIdx+8] & MASK16_1) << 6;
            l0 |= (TMP[tmpIdx+9] & MASK16_1) << 5;
            l0 |= (TMP[tmpIdx+10] & MASK16_1) << 4;
            l0 |= (TMP[tmpIdx+11] & MASK16_1) << 3;
            l0 |= (TMP[tmpIdx+12] & MASK16_1) << 2;
            l0 |= (TMP[tmpIdx+13] & MASK16_1) << 1;
            l0 |= (TMP[tmpIdx+14] & MASK16_1) << 0;
            ARR[longsIdx+0] = l0;
        }
    }

    @Benchmark
    @BenchmarkMode({Mode.Throughput})
    @Fork(1)
    @Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
    @Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
    public void decode15b() {
        shiftLongs(TMP, 30, TMP, 0, 0, MASK16_1);
        for (int iter = 0, tmpIdx = 0, longsIdx = 30; iter < 2; ++iter, tmpIdx += 15, longsIdx += 1) {
            long l0 = TMP[tmpIdx+0] << 14;
            l0 |= TMP[tmpIdx+1] << 13;
            l0 |= TMP[tmpIdx+2] << 12;
            l0 |= TMP[tmpIdx+3] << 11;
            l0 |= TMP[tmpIdx+4] << 10;
            l0 |= TMP[tmpIdx+5] << 9;
            l0 |= TMP[tmpIdx+6] << 8;
            l0 |= TMP[tmpIdx+7] << 7;
            l0 |= TMP[tmpIdx+8] << 6;
            l0 |= TMP[tmpIdx+9] << 5;
            l0 |= TMP[tmpIdx+10] << 4;
            l0 |= TMP[tmpIdx+11] << 3;
            l0 |= TMP[tmpIdx+12] << 2;
            l0 |= TMP[tmpIdx+13] << 1;
            l0 |= TMP[tmpIdx+14] << 0;
            ARR[longsIdx+0] = l0;
        }
    }

Result:

Benchmark               Mode  Cnt          Score         Error  Units
MyBenchmark.decode15a  thrpt   10   65234108.600 ± 1336311.970  ops/s
MyBenchmark.decode15b  thrpt   10  106840656.363 ±  448026.092  ops/s

End-to-end Benchmark

An end-to-end benchmark based on wikimedium1m also looks positive overall:

                  Fuzzy1      131.77      (5.4%)      131.75      (4.2%)   -0.0% (  -9% -   10%) 0.990
               MedPhrase      146.41      (4.5%)      146.44      (4.8%)    0.0% (  -8% -    9%) 0.992
              AndHighMed      643.10      (5.4%)      643.95      (5.5%)    0.1% ( -10% -   11%) 0.939
            HighSpanNear      125.99      (5.7%)      126.48      (4.9%)    0.4% (  -9% -   11%) 0.818
                 Respell      164.81      (4.9%)      165.48      (4.5%)    0.4% (  -8% -   10%) 0.783
        HighSloppyPhrase      103.20      (6.2%)      103.65      (5.8%)    0.4% ( -10% -   13%) 0.816
                  IntNRQ      662.80      (5.0%)      665.87      (5.1%)    0.5% (  -9% -   11%) 0.770
                 Prefix3      882.57      (6.8%)      887.18      (8.6%)    0.5% ( -13% -   17%) 0.832
         LowSloppyPhrase       76.17      (5.5%)       76.57      (5.0%)    0.5% (  -9% -   11%) 0.754
             AndHighHigh      236.71      (5.8%)      237.99      (5.2%)    0.5% (  -9% -   12%) 0.756
                  Fuzzy2      100.40      (5.6%)      101.02      (4.7%)    0.6% (  -9% -   11%) 0.708
              OrHighHigh      154.05      (5.4%)      155.08      (5.0%)    0.7% (  -9% -   11%) 0.684
               LowPhrase      327.86      (4.4%)      330.10      (4.9%)    0.7% (  -8% -   10%) 0.641
BrowseDayOfYearSSDVFacets      120.00      (5.1%)      120.88      (4.5%)    0.7% (  -8% -   10%) 0.627
                 MedTerm     2239.68      (6.3%)     2256.94      (5.9%)    0.8% ( -10% -   13%) 0.690
                 LowTerm     2516.56      (6.1%)     2537.04      (6.3%)    0.8% ( -10% -   14%) 0.679
               OrHighMed      594.85      (6.7%)      599.76      (5.2%)    0.8% ( -10% -   13%) 0.664
         MedSloppyPhrase      256.82      (5.2%)      259.03      (5.1%)    0.9% (  -9% -   11%) 0.601
                PKLookup      221.95      (6.2%)      223.88      (5.6%)    0.9% ( -10% -   13%) 0.641
   BrowseMonthSSDVFacets      135.72      (5.9%)      136.94      (5.4%)    0.9% (  -9% -   12%) 0.615
             LowSpanNear      668.06      (6.4%)      674.95      (5.1%)    1.0% (  -9% -   13%) 0.572
              AndHighLow     1603.74      (7.1%)     1621.34      (5.5%)    1.1% ( -10% -   14%) 0.585
                HighTerm     1927.72      (5.4%)     1949.95      (6.6%)    1.2% ( -10% -   13%) 0.547
    HighIntervalsOrdered      293.62      (5.8%)      297.01      (5.0%)    1.2% (  -9% -   12%) 0.501
              HighPhrase      396.34      (5.4%)      401.03      (5.4%)    1.2% (  -9% -   12%) 0.491
                Wildcard      749.60      (7.8%)      759.43      (8.9%)    1.3% ( -14% -   19%) 0.620
             MedSpanNear      576.19      (5.8%)      584.48      (5.2%)    1.4% (  -9% -   13%) 0.407
BrowseDayOfYearTaxoFacets       32.34      (7.6%)       32.86      (8.0%)    1.6% ( -12% -   18%) 0.513
    BrowseDateTaxoFacets       32.23      (7.7%)       32.76      (8.0%)    1.6% ( -13% -   18%) 0.512
               OrHighLow      526.26      (6.7%)      536.54      (6.3%)    2.0% ( -10% -   16%) 0.342
   BrowseMonthTaxoFacets       35.48      (9.1%)       36.21      (9.1%)    2.1% ( -14% -   22%) 0.474
       HighTermMonthSort      349.19     (12.8%)      364.73     (14.0%)    4.5% ( -19% -   35%) 0.294
   HighTermDayOfYearSort      690.75     (11.2%)      724.87     (11.0%)    4.9% ( -15% -   30%) 0.159

dweiss · 2020-12-10T10:34:21Z

This is excellent, thank you!

jpountz

This is great. I'm curious if you tested other numbers of bits per value than 15?

Co-authored-by: 郭峰 <guofeng.my@bytedance.com>

gf2121 · 2020-12-14T18:49:41Z

This is great. I'm curious if you tested other numbers of bits per value than 15?

The decode15 benchmark result here only wants to tell that we can cause an SIMD optimizization in this way. However, as talked in #2113 , microbenchmarks may not be trustable from time to time. Here is the result of all these bits:

Benchmark              Mode  Cnt          Score         Error  Units
MyBenchmark.decode6a  thrpt   10  247414556.380 ± 2434585.070  ops/s
MyBenchmark.decode6b  thrpt   10  222221023.558 ± 1030956.992  ops/s

MyBenchmark.decode7a  thrpt   10  197406024.801 ± 4874584.420  ops/s
MyBenchmark.decode7b  thrpt   10  147646576.688 ± 3572102.825  ops/s

MyBenchmark.decode12a  thrpt   10  131609297.779 ±  712263.151  ops/s
MyBenchmark.decode12b  thrpt   10  110071926.176 ± 1302030.745  ops/s

MyBenchmark.decode14a  thrpt   10   64464919.397 ± 1884249.466  ops/s
MyBenchmark.decode14b  thrpt   10  116994814.109 ±  467860.907  ops/s

MyBenchmark.decode15a  thrpt   10   65234108.600 ± 1336311.970  ops/s
MyBenchmark.decode15b  thrpt   10  106840656.363 ±  448026.092  ops/s

MyBenchmark.decode24a  thrpt   10  55316236.195 ± 2305321.938  ops/s
MyBenchmark.decode24b  thrpt   10  70260091.330 ± 1545397.554  ops/s

Accroding to this result, methods will get a bit slower when bits per value <= 12. but when i removed their optimization, the end-to-end benchmark result become slower...

              HighPhrase      174.64      (3.9%)      173.31      (4.3%)   -0.8% (  -8% -    7%) 0.556
                 Respell      196.80      (3.5%)      195.51      (3.6%)   -0.7% (  -7% -    6%) 0.562
         LowSloppyPhrase      424.47      (4.2%)      422.49      (3.7%)   -0.5% (  -8% -    7%) 0.711
                  Fuzzy2       68.12     (16.9%)       67.84     (17.7%)   -0.4% ( -29% -   41%) 0.939
                  Fuzzy1      129.37      (7.1%)      129.02      (6.1%)   -0.3% ( -12% -   13%) 0.900
   BrowseMonthSSDVFacets      138.95      (2.0%)      138.85      (2.0%)   -0.1% (  -4% -    4%) 0.905
         MedSloppyPhrase      360.37      (4.2%)      360.20      (4.4%)   -0.0% (  -8% -    8%) 0.973
    HighIntervalsOrdered      122.33      (2.1%)      122.32      (1.9%)   -0.0% (  -3% -    4%) 0.992
              OrHighHigh       66.73      (4.6%)       66.76      (4.0%)    0.0% (  -8% -    9%) 0.978
            HighSpanNear      105.13      (2.7%)      105.28      (2.4%)    0.1% (  -4% -    5%) 0.865
BrowseDayOfYearSSDVFacets      122.74      (2.6%)      122.99      (1.3%)    0.2% (  -3% -    4%) 0.747
                 Prefix3      334.95      (6.3%)      335.70      (4.2%)    0.2% (  -9% -   11%) 0.894
             LowSpanNear      561.05      (3.8%)      562.69      (4.7%)    0.3% (  -7% -    9%) 0.830
   BrowseMonthTaxoFacets       36.31      (6.9%)       36.43      (6.2%)    0.3% ( -12% -   14%) 0.876
                Wildcard      147.10      (3.3%)      147.68      (3.0%)    0.4% (  -5% -    6%) 0.693
    BrowseDateTaxoFacets       32.84      (5.7%)       32.99      (5.7%)    0.4% ( -10% -   12%) 0.810
        HighSloppyPhrase       44.53      (6.5%)       44.72      (5.7%)    0.4% ( -11% -   13%) 0.821
               LowPhrase      297.21      (3.6%)      298.53      (2.9%)    0.4% (  -5% -    7%) 0.668
               OrHighLow      796.60      (7.5%)      800.20      (6.9%)    0.5% ( -12% -   15%) 0.842
                 LowTerm     2376.34      (6.6%)     2387.16      (4.5%)    0.5% ( -10% -   12%) 0.800
                PKLookup      227.92      (3.1%)      229.30      (1.8%)    0.6% (  -4% -    5%) 0.447
BrowseDayOfYearTaxoFacets       32.93      (6.1%)       33.13      (5.6%)    0.6% ( -10% -   13%) 0.737
                  IntNRQ      584.74      (7.6%)      589.54      (7.9%)    0.8% ( -13% -   17%) 0.738
              AndHighMed      805.31      (5.7%)      812.48      (4.3%)    0.9% (  -8% -   11%) 0.577
             AndHighHigh      382.16      (4.4%)      386.58      (3.5%)    1.2% (  -6% -    9%) 0.359
               OrHighMed      404.42      (5.1%)      409.44      (4.6%)    1.2% (  -8% -   11%) 0.418
                 MedTerm     2139.91      (6.0%)     2167.39      (5.0%)    1.3% (  -9% -   13%) 0.461
             MedSpanNear      273.38      (3.0%)      276.90      (3.1%)    1.3% (  -4% -    7%) 0.186
   HighTermDayOfYearSort      310.36     (16.4%)      314.78     (15.6%)    1.4% ( -26% -   40%) 0.779
       HighTermMonthSort      854.54     (13.1%)      869.18     (11.5%)    1.7% ( -20% -   30%) 0.660
              AndHighLow     1592.57      (7.8%)     1620.67      (5.6%)    1.8% ( -10% -   16%) 0.411
                HighTerm     1464.88      (5.7%)     1491.23      (4.4%)    1.8% (  -7% -   12%) 0.263
               MedPhrase      627.29      (4.3%)      639.12      (3.9%)    1.9% (  -6% -   10%) 0.149

So I chose to pay more attention to the end-to-end result, and reserved all optimizations for them:)

Co-authored-by: 郭峰 <guofeng.my@bytedance.com>

Exact and operation to get a SIMD optimize

94fd1b2

gf2121 force-pushed the LUCENE-9636 branch from e4a40be to 94fd1b2 Compare December 10, 2020 14:06

jpountz approved these changes Dec 14, 2020

View reviewed changes

jpountz merged commit ecd47a8 into apache:master Dec 14, 2020

jpountz pushed a commit that referenced this pull request Dec 14, 2020

LUCENE-9636: Exact and operation to get a SIMD optimize (#2139)

e753535

Co-authored-by: 郭峰 <guofeng.my@bytedance.com>

gf2121 mentioned this pull request Dec 14, 2020

LUCENE-9629: Use computed masks #2113

Merged

ctargett pushed a commit to ctargett/lucene-solr that referenced this pull request Dec 16, 2020

LUCENE-9636: Exact and operation to get a SIMD optimize (apache#2139)

9dfc6fc

Co-authored-by: 郭峰 <guofeng.my@bytedance.com>

epugh pushed a commit to epugh/lucene-solr-1 that referenced this pull request Jan 15, 2021

LUCENE-9636: Exact and operation to get a SIMD optimize (apache#2139)

7529feb

Co-authored-by: 郭峰 <guofeng.my@bytedance.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-9636: Extract and operation to get a SIMD optimize #2139

LUCENE-9636: Extract and operation to get a SIMD optimize #2139

gf2121 commented Dec 10, 2020 •

edited

Loading

dweiss commented Dec 10, 2020

jpountz left a comment

gf2121 commented Dec 14, 2020

LUCENE-9636: Extract and operation to get a SIMD optimize #2139

LUCENE-9636: Extract and operation to get a SIMD optimize #2139

Conversation

gf2121 commented Dec 10, 2020 • edited Loading

Description

Tests

Method Benchmark

End-to-end Benchmark

dweiss commented Dec 10, 2020

jpountz left a comment

Choose a reason for hiding this comment

gf2121 commented Dec 14, 2020

gf2121 commented Dec 10, 2020 •

edited

Loading