feat: soft delete optimize #12339

fudongyingluck · 2023-05-30T02:37:56Z

as es issuse metioned when soft delete enable the numDeletesToMerge function is very time consume part. As the following picture show, there is actually calculate duplicate in there.

This change want to reuse the numDeletesToMerge result to reduce the time used

fudongyingluck · 2023-06-05T08:21:01Z

This is the esrally result. The command is likeesrally race --track=http_logs --target-hosts=*:9201 --pipeline=benchmark-only --offline --user-tag=softdelete:baseline --challenge=update

|                                                        Metric |   Task |        Baseline |       Contender |        Diff |   Unit |   Diff % |
|--------------------------------------------------------------:|-------:|----------------:|----------------:|------------:|-------:|---------:|
|                    Cumulative indexing time of primary shards |        |   515.49        |   504.15        |   -11.3398  |    min |   -2.20% |
|             Min cumulative indexing time across primary shard |        |     0           |     0           |     0       |    min |    0.00% |
|          Median cumulative indexing time across primary shard |        |    17.7529      |    17.9699      |     0.2169  |    min |   +1.22% |
|             Max cumulative indexing time across primary shard |        |   404.723       |   393.369       |   -11.3536  |    min |   -2.81% |
|           Cumulative indexing throttle time of primary shards |        |     0           |     0           |     0       |    min |    0.00% |
|    Min cumulative indexing throttle time across primary shard |        |     0           |     0           |     0       |    min |    0.00% |
| Median cumulative indexing throttle time across primary shard |        |     0           |     0           |     0       |    min |    0.00% |
|    Max cumulative indexing throttle time across primary shard |        |     0           |     0           |     0       |    min |    0.00% |
|                       Cumulative merge time of primary shards |        |   133.81        |   127.489       |    -6.32017 |    min |   -4.72% |
|                      Cumulative merge count of primary shards |        |   173           |   172           |    -1       |        |   -0.58% |
|                Min cumulative merge time across primary shard |        |     0           |     0           |     0       |    min |    0.00% |
|             Median cumulative merge time across primary shard |        |     2.61536     |     2.96084     |     0.34548 |    min |  +13.21% |
|                Max cumulative merge time across primary shard |        |   118.648       |   110.923       |    -7.7245  |    min |   -6.51% |
|              Cumulative merge throttle time of primary shards |        |    57.0305      |    55.1042      |    -1.92633 |    min |   -3.38% |
|       Min cumulative merge throttle time across primary shard |        |     0           |     0           |     0       |    min |    0.00% |
|    Median cumulative merge throttle time across primary shard |        |     0.215533    |     0.307242    |     0.09171 |    min |  +42.55% |
|       Max cumulative merge throttle time across primary shard |        |    55.2842      |    53.1749      |    -2.10932 |    min |   -3.82% |
|                     Cumulative refresh time of primary shards |        |    21.5803      |    20.5713      |    -1.009   |    min |   -4.68% |
|                    Cumulative refresh count of primary shards |        |   668           |   674           |     6       |        |   +0.90% |
|              Min cumulative refresh time across primary shard |        |     0           |     0           |     0       |    min |    0.00% |
|           Median cumulative refresh time across primary shard |        |     0.542333    |     0.508642    |    -0.03369 |    min |   -6.21% |
|              Max cumulative refresh time across primary shard |        |    18.1363      |    17.4352      |    -0.70113 |    min |   -3.87% |
|                       Cumulative flush time of primary shards |        |     9.37332     |    10.4646      |     1.09132 |    min |  +11.64% |
|                      Cumulative flush count of primary shards |        |    63           |    64           |     1       |        |   +1.59% |
|                Min cumulative flush time across primary shard |        |     0.00296667  |     0.0001      |    -0.00287 |    min |  -96.63% |
|             Median cumulative flush time across primary shard |        |     0.0971583   |     0.0769667   |    -0.02019 |    min |  -20.78% |
|                Max cumulative flush time across primary shard |        |     8.6855      |     9.83638     |     1.15088 |    min |  +13.25% |
|                                       Total Young Gen GC time |        |  1070.97        |  1065.08        |    -5.889   |      s |   -0.55% |
|                                      Total Young Gen GC count |        |  8254           |  8187           |   -67       |        |   -0.81% |
|                                         Total Old Gen GC time |        |     0.586       |     0           |    -0.586   |      s | -100.00% |
|                                        Total Old Gen GC count |        |     3           |     0           |    -3       |        | -100.00% |
|                                                    Store size |        |    17.0535      |    16.9082      |    -0.14531 |     GB |   -0.85% |
|                                                 Translog size |        |     4.09782e-07 |     4.09782e-07 |     0       |     GB |    0.00% |
|                                        Heap used for segments |        |     0           |     0           |     0       |     MB |    0.00% |
|                                      Heap used for doc values |        |     0           |     0           |     0       |     MB |    0.00% |
|                                           Heap used for terms |        |     0           |     0           |     0       |     MB |    0.00% |
|                                           Heap used for norms |        |     0           |     0           |     0       |     MB |    0.00% |
|                                          Heap used for points |        |     0           |     0           |     0       |     MB |    0.00% |
|                                   Heap used for stored fields |        |     0           |     0           |     0       |     MB |    0.00% |
|                                                 Segment count |        |   158           |   163           |     5       |        |   +3.16% |
|                                   Total Ingest Pipeline count |        |     0           |     0           |     0       |        |    0.00% |
|                                    Total Ingest Pipeline time |        |     0           |     0           |     0       |     ms |    0.00% |
|                                  Total Ingest Pipeline failed |        |     0           |     0           |     0       |        |    0.00% |
|                                                Min Throughput | update | 23056.7         | 23029.1         |   -27.5735  | docs/s |   -0.12% |
|                                               Mean Throughput | update | 29585.3         | 29794           |   208.699   | docs/s |   +0.71% |
|                                             Median Throughput | update | 28990.2         | 29011.7         |    21.4849  | docs/s |   +0.07% |
|                                                Max Throughput | update | 36131.5         | 36197.3         |    65.8749  | docs/s |   +0.18% |
|                                       50th percentile latency | update |  1421.89        |  1437.74        |    15.8507  |     ms |   +1.11% |
|                                       90th percentile latency | update |  2410.13        |  2420.23        |    10.1008  |     ms |   +0.42% |
|                                       99th percentile latency | update |  7076.3         |  7045.81        |   -30.4936  |     ms |   -0.43% |
|                                     99.9th percentile latency | update | 11033.5         | 10406.9         |  -626.525   |     ms |   -5.68% |
|                                    99.99th percentile latency | update | 14342.9         | 13304.1         | -1038.85    |     ms |   -7.24% |
|                                      100th percentile latency | update | 21652.9         | 21399.9         |  -253       |     ms |   -1.17% |
|                                  50th percentile service time | update |  1421.89        |  1437.74        |    15.8507  |     ms |   +1.11% |
|                                  90th percentile service time | update |  2410.13        |  2420.23        |    10.1008  |     ms |   +0.42% |
|                                  99th percentile service time | update |  7076.3         |  7045.81        |   -30.4936  |     ms |   -0.43% |
|                                99.9th percentile service time | update | 11033.5         | 10406.9         |  -626.525   |     ms |   -5.68% |
|                               99.99th percentile service time | update | 14342.9         | 13304.1         | -1038.85    |     ms |   -7.24% |
|                                 100th percentile service time | update | 21652.9         | 21399.9         |  -253       |     ms |   -1.17% |
|                                                    error rate | update |     0           |     0           |     0       |      % |    0.00% |```

fudongyingluck · 2023-06-06T04:56:19Z

lucene benchmark result, python3.10 src/python/localrun.py -source wikimediumall

            BrowseDateSSDVFacets        1.54     (11.4%)        1.46     (16.1%)   -5.2% ( -29% -   25%) 0.242
          OrHighMedDayTaxoFacets        5.38      (5.6%)        5.24      (5.0%)   -2.6% ( -12% -    8%) 0.127
                        PKLookup      279.48      (3.0%)      273.06      (3.1%)   -2.3% (  -8% -    3%) 0.018
            MedTermDayTaxoFacets       35.78      (2.2%)       35.10      (1.8%)   -1.9% (  -5% -    2%) 0.002
            BrowseDateTaxoFacets        7.23     (22.3%)        7.10     (23.8%)   -1.8% ( -39% -   56%) 0.802
            HighIntervalsOrdered       10.59      (8.9%)       10.42      (8.6%)   -1.6% ( -17% -   17%) 0.568
       BrowseDayOfYearTaxoFacets        7.30     (21.8%)        7.19     (23.9%)   -1.6% ( -38% -   56%) 0.829
             LowIntervalsOrdered        4.55      (7.1%)        4.48      (7.1%)   -1.5% ( -14% -   13%) 0.495
             MedIntervalsOrdered        6.90      (8.1%)        6.81      (7.3%)   -1.4% ( -15% -   15%) 0.565
                          Fuzzy2      118.84      (2.2%)      117.28      (2.5%)   -1.3% (  -5% -    3%) 0.078
                         Respell       82.74      (3.1%)       81.79      (4.0%)   -1.2% (  -7% -    6%) 0.308
               HighTermMonthSort     3093.29      (5.8%)     3057.85      (6.7%)   -1.1% ( -12% -   12%) 0.562
     BrowseRandomLabelTaxoFacets        6.40     (38.8%)        6.33     (40.9%)   -1.1% ( -58% -  128%) 0.930
                        HighTerm      791.45      (5.1%)      783.46      (4.7%)   -1.0% ( -10% -    9%) 0.517
                      HighPhrase       30.44      (2.3%)       30.16      (2.2%)   -0.9% (  -5% -    3%) 0.190
                          Fuzzy1      108.68      (2.7%)      107.67      (3.6%)   -0.9% (  -7% -    5%) 0.359
                    OrHighNotMed      320.94      (6.6%)      318.02      (5.3%)   -0.9% ( -11% -   11%) 0.629
                   OrNotHighHigh      468.36      (5.3%)      464.33      (4.2%)   -0.9% (  -9% -    9%) 0.568
                 LowSloppyPhrase       34.97      (4.1%)       34.69      (4.2%)   -0.8% (  -8% -    7%) 0.534
                       MedPhrase      242.27      (2.5%)      240.32      (1.9%)   -0.8% (  -5% -    3%) 0.248
                      AndHighMed       77.34      (6.0%)       76.76      (5.7%)   -0.8% ( -11% -   11%) 0.686
                    OrHighNotLow      744.00      (6.5%)      738.66      (5.8%)   -0.7% ( -12% -   12%) 0.711
                      AndHighLow      586.58      (3.5%)      582.51      (4.2%)   -0.7% (  -8% -    7%) 0.573
                HighSloppyPhrase        3.91      (4.5%)        3.89      (3.9%)   -0.6% (  -8% -    8%) 0.670
                     MedSpanNear       37.46      (2.1%)       37.26      (2.5%)   -0.6% (  -5% -    4%) 0.441
                       LowPhrase      153.02      (2.2%)      152.17      (2.1%)   -0.6% (  -4% -    3%) 0.417
                    OrNotHighLow     1030.00      (3.2%)     1025.40      (3.5%)   -0.4% (  -6% -    6%) 0.675
                        Wildcard       35.75      (3.2%)       35.59      (4.5%)   -0.4% (  -7% -    7%) 0.723
                         MedTerm      761.12      (5.8%)      757.86      (6.0%)   -0.4% ( -11% -   12%) 0.819
                     AndHighHigh       22.42      (6.5%)       22.33      (5.7%)   -0.4% ( -11% -   12%) 0.830
                         LowTerm      689.41      (3.9%)      686.65      (4.6%)   -0.4% (  -8% -    8%) 0.768
                    HighSpanNear        2.47      (4.2%)        2.46      (5.0%)   -0.4% (  -9% -    9%) 0.789
        AndHighHighDayTaxoFacets        7.97      (1.6%)        7.94      (1.9%)   -0.4% (  -3% -    3%) 0.522
                   OrHighNotHigh      352.84      (6.6%)      351.68      (4.9%)   -0.3% ( -11% -   11%) 0.859
         AndHighMedDayTaxoFacets       48.80      (1.6%)       48.65      (2.3%)   -0.3% (  -4% -    3%) 0.611
                 MedSloppyPhrase       24.12      (2.4%)       24.04      (2.5%)   -0.3% (  -5% -    4%) 0.684
                       OrHighMed       37.82      (6.3%)       37.72      (5.5%)   -0.3% ( -11% -   12%) 0.891
            HighTermTitleBDVSort        7.13      (8.7%)        7.11      (8.1%)   -0.2% ( -15% -   18%) 0.927
                     LowSpanNear       26.13      (3.7%)       26.08      (3.3%)   -0.2% (  -6% -    7%) 0.866
                         Prefix3      408.84      (1.3%)      408.62      (2.1%)   -0.1% (  -3% -    3%) 0.923
                    OrNotHighMed      469.82      (4.2%)      470.09      (3.6%)    0.1% (  -7% -    8%) 0.963
               HighTermTitleSort      105.40      (2.8%)      105.54      (4.6%)    0.1% (  -7% -    7%) 0.914
                      OrHighHigh       13.48      (5.0%)       13.51      (4.7%)    0.2% (  -9% -   10%) 0.905
                      TermDTSort      241.07      (3.6%)      242.46      (4.9%)    0.6% (  -7% -    9%) 0.671
                       OrHighLow      235.33      (5.1%)      237.04      (5.2%)    0.7% (  -9% -   11%) 0.655
     BrowseRandomLabelSSDVFacets        4.96      (3.8%)        5.00     (11.6%)    0.9% ( -13% -   16%) 0.746
           HighTermDayOfYearSort      290.03      (3.5%)      292.75      (3.7%)    0.9% (  -6% -    8%) 0.408
                          IntNRQ       52.81     (18.2%)       54.52     (15.7%)    3.2% ( -25% -   45%) 0.546
       BrowseDayOfYearSSDVFacets        6.11      (4.2%)        6.32     (10.6%)    3.4% ( -10% -   19%) 0.186
           BrowseMonthTaxoFacets        9.69     (33.2%)       10.17     (34.1%)    5.0% ( -46% -  108%) 0.641
           BrowseMonthSSDVFacets        6.35      (2.9%)        6.68     (10.3%)    5.2% (  -7% -   18%) 0.030

and the part of cpu profile result

PERCENT       CPU SAMPLES   STACK
4.71%         49024         org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$VaryingBPVReader#getLongValue()
3.88%         40306         org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegmentNHLD()
3.50%         36379         java.nio.Buffer#scope()
3.35%         34860         org.apache.lucene.queries.intervals.OrderedIntervalsSource$OrderedIntervalIterator#nextInterval()
3.18%         33041         org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts#countAll()

CPU merged search profile for baseline:
PERCENT       CPU SAMPLES   STACK
6.19%         63449         org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$VaryingBPVReader#getLongValue()
3.63%         37149         org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts#countAll()
3.58%         36660         org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegmentNHLD()
3.46%         35483         org.apache.lucene.queries.intervals.OrderedIntervalsSource$OrderedIntervalIterator#nextInterval()
3.19%         32707         org.apache.lucene.util.packed.DirectReader$DirectPackedReader20#get()```

jpountz

No computing the number of deletes twice makes sense to me. What I'm not super happy about is that it's a bit trappy for merge policies, they need to be very careful to call the right methods to not compute it twice. E.g. I believe that LogMergePolicy needs a similar fix to the one that you made to TieredMergePolicy.

As a potential alternative, I wonder if IndexWriter could use a wrapper around the MergeContext which would memoize the number of deletes of every SegmentCommitInfo in a hash map when calling the merge policy. This way, if you happen to call numDeletesToMerge twice on the same SegmentCommitInfo, the second one would be served from the cache?

This reverts commit d21ae98.

fudongyingluck · 2023-06-09T04:02:52Z

Thanks @jpountz for your time. I really think this is a good idea, much better than I do. I wonder if the newest commit implement your idea. I'll commit some test cases and post benchmark result later.

fudongyingluck · 2023-06-09T06:18:10Z

lucene benchmark result, python3.10 src/python/localrun.py -source wikimediumall

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
       BrowseDayOfYearSSDVFacets        6.29      (7.5%)        6.16      (9.4%)   -2.1% ( -17% -   15%) 0.428
            HighTermTitleBDVSort       10.68      (8.0%)       10.51      (4.3%)   -1.6% ( -12% -   11%) 0.442
                          IntNRQ       47.94      (1.8%)       47.23      (8.1%)   -1.5% ( -11% -    8%) 0.422
          OrHighMedDayTaxoFacets        6.09      (6.9%)        6.02      (5.2%)   -1.1% ( -12% -   11%) 0.563
           BrowseMonthSSDVFacets        6.56      (8.0%)        6.50      (8.3%)   -0.9% ( -15% -   16%) 0.719
             MedIntervalsOrdered       27.91      (4.9%)       27.67      (5.1%)   -0.9% ( -10% -    9%) 0.579
                HighSloppyPhrase        3.61      (4.3%)        3.58      (4.4%)   -0.7% (  -9% -    8%) 0.596
       BrowseDayOfYearTaxoFacets        7.06      (6.1%)        7.04      (5.2%)   -0.3% ( -10% -   11%) 0.850
                       OrHighLow      351.52      (2.9%)      350.34      (3.4%)   -0.3% (  -6% -    6%) 0.737
                      OrHighHigh       20.17      (3.4%)       20.11      (3.7%)   -0.3% (  -7% -    7%) 0.776
                 MedSloppyPhrase        4.99      (4.0%)        4.97      (4.7%)   -0.3% (  -8% -    8%) 0.822
     BrowseRandomLabelTaxoFacets        6.20      (7.1%)        6.19      (5.3%)   -0.2% ( -11% -   13%) 0.914
                       OrHighMed      106.18      (3.5%)      105.99      (3.4%)   -0.2% (  -6% -    6%) 0.866
            BrowseDateTaxoFacets        7.00      (5.9%)        6.99      (4.8%)   -0.1% ( -10% -   11%) 0.947
                          Fuzzy1       88.03      (3.3%)       87.96      (1.9%)   -0.1% (  -5% -    5%) 0.925
            HighIntervalsOrdered       12.93      (4.5%)       12.92      (4.4%)   -0.0% (  -8% -    9%) 0.979
                         Prefix3      224.61      (2.8%)      224.69      (2.3%)    0.0% (  -4% -    5%) 0.966
                 LowSloppyPhrase       22.07      (4.2%)       22.10      (4.1%)    0.1% (  -7% -    8%) 0.928
                    OrNotHighMed      403.85      (2.4%)      404.49      (2.9%)    0.2% (  -4% -    5%) 0.851
                    OrHighNotLow      468.49      (5.1%)      469.62      (4.9%)    0.2% (  -9% -   10%) 0.879
               HighTermMonthSort     3512.17      (6.9%)     3523.19      (7.5%)    0.3% ( -13% -   15%) 0.890
                    OrHighNotMed      532.57      (4.3%)      534.39      (3.6%)    0.3% (  -7% -    8%) 0.786
                         MedTerm     1019.27      (4.5%)     1022.80      (4.3%)    0.3% (  -8% -    9%) 0.805
        AndHighHighDayTaxoFacets        7.35      (2.8%)        7.38      (1.7%)    0.4% (  -4% -    5%) 0.633
                     AndHighHigh       32.60      (3.8%)       32.72      (4.2%)    0.4% (  -7% -    8%) 0.776
                      AndHighLow      662.12      (3.5%)      664.62      (3.9%)    0.4% (  -6% -    7%) 0.745
                          Fuzzy2       91.31      (4.0%)       91.66      (2.3%)    0.4% (  -5% -    6%) 0.709
                   OrNotHighHigh      675.72      (3.3%)      679.20      (3.6%)    0.5% (  -6% -    7%) 0.636
                      AndHighMed       96.86      (5.9%)       97.41      (6.3%)    0.6% ( -11% -   13%) 0.771
                        PKLookup      281.91      (3.6%)      283.54      (2.6%)    0.6% (  -5% -    7%) 0.566
                        Wildcard      183.21      (4.9%)      184.35      (2.8%)    0.6% (  -6% -    8%) 0.619
                   OrHighNotHigh      500.76      (3.4%)      504.24      (3.4%)    0.7% (  -5% -    7%) 0.513
                       LowPhrase      183.76      (2.7%)      185.24      (2.5%)    0.8% (  -4% -    6%) 0.326
                         LowTerm      732.82      (3.0%)      738.99      (3.0%)    0.8% (  -4% -    6%) 0.368
            MedTermDayTaxoFacets       38.20      (2.9%)       38.53      (1.9%)    0.9% (  -3% -    5%) 0.273
                       MedPhrase       85.79      (2.5%)       86.54      (2.3%)    0.9% (  -3% -    5%) 0.250
                        HighTerm      678.62      (4.8%)      684.64      (4.4%)    0.9% (  -7% -   10%) 0.544
         AndHighMedDayTaxoFacets       34.42      (2.5%)       34.73      (1.5%)    0.9% (  -2% -    4%) 0.164
             LowIntervalsOrdered       16.93      (3.6%)       17.09      (3.2%)    1.0% (  -5% -    7%) 0.373
                     MedSpanNear       25.65      (3.4%)       25.89      (4.3%)    1.0% (  -6% -    9%) 0.440
                    HighSpanNear        9.16      (3.9%)        9.25      (4.7%)    1.0% (  -7% -    9%) 0.473
                      HighPhrase      136.81      (2.8%)      138.28      (2.8%)    1.1% (  -4% -    6%) 0.231
                         Respell       67.25      (4.6%)       68.00      (3.4%)    1.1% (  -6% -    9%) 0.377
     BrowseRandomLabelSSDVFacets        5.26      (7.4%)        5.32      (7.2%)    1.1% ( -12% -   16%) 0.627
                     LowSpanNear        7.90      (3.7%)        7.99      (3.9%)    1.1% (  -6% -    9%) 0.347
           HighTermDayOfYearSort      400.43      (2.6%)      405.41      (2.7%)    1.2% (  -3% -    6%) 0.137
                    OrNotHighLow      818.63      (3.1%)      828.86      (3.0%)    1.2% (  -4% -    7%) 0.199
               HighTermTitleSort       62.96      (2.5%)       63.77      (3.1%)    1.3% (  -4% -    7%) 0.149
           BrowseMonthTaxoFacets       10.34     (33.0%)       10.47     (33.7%)    1.3% ( -49% -  101%) 0.902
                      TermDTSort      239.19      (4.0%)      242.85      (6.9%)    1.5% (  -8% -   12%) 0.390
            BrowseDateSSDVFacets        1.49     (12.4%)        1.54     (11.4%)    3.4% ( -18% -   31%) 0.362```

jpountz

This looks great. Can you add a CHANGES entry under 9.7?

jpountz · 2023-06-09T07:03:18Z

lucene/core/src/java/org/apache/lucene/index/CachingMergeContext.java

+ * a wrapper of IndexWriter MergeContext. Try to cache the {@link
+ * #numDeletesToMerge(SegmentCommitInfo)} result in merge phase, to avoid duplicate calculation
+ */
+public class CachingMergeContext implements MergePolicy.MergeContext {


Can you make it pkg-private instead of public?

Yes, I've done this ~

jpountz · 2023-06-09T07:25:50Z

I'll note that there is still room for improvement, as this change doesn't cache the number of soft deletes across calls to findMerges. But the fix is so simple and contained, this looks to me like a good case of progress over perfection.

…dc8ca633e8bcf`) (#20) * Add next minor version 9.7.0 * Fix SynonymQuery equals implementation (apache#12260) The term member of TermAndBoost used to be a Term instance and became a BytesRef with apache#11941, which means its equals impl won't take the field name into account. The SynonymQuery equals impl needs to be updated accordingly to take the field into account as well, otherwise synonym queries with same term and boost across different fields are equal which is a bug. * Fix MMapDirectory documentation for Java 20 (apache#12265) * Don't generate stacktrace in CollectionTerminatedException (apache#12270) CollectionTerminatedException is always caught and never exposed to users so there's no point in filling in a stack-trace for it. * add missing changelog entry for apache#12260 * Add missing author to changelog entry for apache#12220 * Make query timeout members final in ExitableDirectoryReader (apache#12274) There's a couple of places in the Exitable wrapper classes where queryTimeout is set within the constructor and never modified. This commit makes such members final. * Update javadocs for QueryTimeout (apache#12272) QueryTimeout was introduced together with ExitableDirectoryReader but is now also optionally set to the IndexSearcher to wrap the bulk scorer with a TimeLimitingBulkScorer. Its javadocs needs updating. * Make TimeExceededException members final (apache#12271) TimeExceededException has three members that are set within its constructor and never modified. They can be made final. * DOAP changes for release 9.6.0 * Add back-compat indices for 9.6.0 * `ToParentBlockJoinQuery` Explain Support Score Mode (apache#12245) (apache#12283) * `ToParentBlockJoinQuery` Explain Support Score Mode --------- Co-authored-by: Marcus <marcuseagan@gmail.com> * Simplify SliceExecutor and QueueSizeBasedExecutor (apache#12285) The only behaviour that QueueSizeBasedExecutor overrides from SliceExecutor is when to execute on the caller thread. There is no need to override the whole invokeAll method for that. Instead, this commit introduces a shouldExecuteOnCallerThread method that can be overridden. * [Backport] GITHUB-11838 Add api to allow concurrent query rewrite (apache#12197) * GITHUB-11838 Change API to allow concurrent query rewrite (apache#11840) Replace Query#rewrite(IndexReader) with Query#rewrite(IndexSearcher) Co-authored-by: Patrick Zhai <zhaih@users.noreply.github.com> Co-authored-by: Adrien Grand <jpountz@gmail.com> Backport of apache#11840 Changes from original: - Query keeps `rewrite(IndexReader)`, but it is now deprecated - VirtualMethod is used to correct delegate to the overridden methods - The changes to `RewriteMethod` type classes are reverted, this increased the backwards compatibility impact. ------------------------------ ### Description Issue: apache#11838 #### Updated Proposal * Change signature of rewrite to `rewrite(IndexSearcher)` * How did I migrate the usage: * Use Intellij to do preliminary refactoring for me * For test usage, use searcher whenever is available, otherwise create one using `newSearcher(reader)` * For very few non-test classes which doesn't have IndexSearcher available but called rewrite, create a searcher using `new IndexSearcher(reader)`, tried my best to avoid creating it recurrently (Especially in `FieldQuery`) * For queries who have implemented the rewrite and uses some part of reader's functionality, use shortcut method when possible, otherwise pull out the reader from indexSearcher. * Backport: Concurrent rewrite for KnnVectorQuery (apache#12160) (apache#12288) * Concurrent rewrite for KnnVectorQuery (apache#12160) - Reduce overhead of non-concurrent search by preserving original execution - Improve readability by factoring into separate functions --------- Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com> * adjusting for backport --------- Co-authored-by: Kaival Parikh <46070017+kaivalnp@users.noreply.github.com> Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com> * toposort use iterator to avoid stackoverflow (apache#12286) Co-authored-by: tangdonghai <tangdonghai@meituan.com> # Conflicts: # lucene/CHANGES.txt * Fix test to compile with Java 11 after backport of apache#12286 * Update Javadoc for topoSortStates method after apache#12286 (apache#12292) * Optimize HNSW diversity calculation (apache#12235) * Minor cleanup and improvements to DaciukMihovAutomatonBuilder (apache#12305) * GITHUB-12291: Skip blank lines from stopwords list. (apache#12299) * Wrap Query rewrite backwards layer with AccessController (apache#12308) * Make sure APIJAR reproduces with different timezone (unfortunately java encodes the date using local timezone) (apache#12315) * Add multi-thread searchability to OnHeapHnswGraph (apache#12257) * Fix backport error * [MINOR] Update javadoc in Query class (apache#12233) - add a few missing full stops - update wording in the description of Query#equals method * [Backport] Integrate the Incubating Panama Vector API apache#12311 (apache#12327) Leverage accelerated vector hardware instructions in Vector Search. Lucene already has a mechanism that enables the use of non-final JDK APIs, currently used for the Previewing Pamana Foreign API. This change expands this mechanism to include the Incubating Pamana Vector API. When the jdk.incubator.vector module is present at run time the Panamaized version of the low-level primitives used by Vector Search is enabled. If not present, the default scalar version of these low-level primitives is used (as it was previously). Currently, we're only targeting support for JDK 20. A subsequent PR should evaluate JDK 21. --------- Co-authored-by: Uwe Schindler <uschindler@apache.org> Co-authored-by: Robert Muir <rmuir@apache.org> * Parallelize knn query rewrite across slices rather than segments (apache#12325) The concurrent query rewrite for knn vectory query introduced with apache#12160 requests one thread per segment to the executor. To align this with the IndexSearcher parallel behaviour, we should rather parallelize across slices. Also, we can reuse the same slice executor instance that the index searcher already holds, in that way we are using a QueueSizeBasedExecutor when a thread pool executor is provided. * Optimize ConjunctionDISI.createConjunction (apache#12328) This method is showing up as a little hot when profiling some queries. Almost all the time spent in this method is just burnt on ceremony around stream indirections that don't inline. Moving this to iterators, simplifying the check for same doc id and also saving one iteration (for the min cost) makes this method far cheaper and easier to read. * Update changes to be correct with ARM (it is called NEON there) * GH#12321: Marked DaciukMihovAutomatonBuilder as deprecated (apache#12332) Preparing to reduce visibility of this class in a future release * add BitSet.clear() (apache#12268) # Conflicts: # lucene/CHANGES.txt * Clenaup and update changes and synchronize with 9.x * Update TestVectorUtilProviders.java (apache#12338) * Don't generate stacktrace for TimeExceededException (apache#12335) The exception is package private and never rethrown, we can avoid generating a stacktrace for it. * Introduced the Word2VecSynonymFilter (apache#12169) Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io> * Word2VecSynonymFilter constructor null check (apache#12169) * Use thread-safe search version of HnswGraphSearcher (apache#12246) Addressing comment received in the PR apache#12246 * Word2VecSynonymProvider to use standard Integer max value for hnsw searches (apache#12235) We observed this change was not ported previously from main in an old cherry-pick * Fix searchafter high latency when after value is out of range for segment (apache#12334) * Make memory fence in `ByteBufferGuard` explicit (apache#12290) * Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit (apache#12320) * Add updateDocuments API which accept a query (reopen) (apache#12346) * GITHUB#11350: Handle backward compatibility when merging segments with different FieldInfo This commits restores Lucene 9's ability to handle indices created with Lucene 8 where there are discrepancies in FieldInfos, such as different IndexOptions * [Tessellator] Improve the checks that validate the diagonal between two polygon nodes (apache#12353) # Conflicts: # lucene/CHANGES.txt * feat: soft delete optimize (apache#12339) * Better paging when random reads go backwards (apache#12357) When reading data from outside the buffer, BufferedIndexInput always resets its buffer to start at the new read position. If we are reading backwards (for example, using an OffHeapFSTStore for a terms dictionary) then this can have the effect of re-reading the same data over and over again. This commit changes BufferedIndexInput to use paging when reading backwards, so that if we ask for a byte that is before the current buffer, we read a block of data of bufferSize that ends at the previous buffer start. Fixes apache#12356 * Work around SecurityManager issues during initialization of vector api (JDK-8309727) (apache#12362) * Restrict GraphTokenStreamFiniteStrings#articulationPointsRecurse recursion depth (apache#12249) * Implement MMapDirectory with Java 21 Project Panama Preview API (apache#12294) Backport incl JDK21 apijar file with java.util.Objects regenerated * remove relic in apijar folder caused by vector additions * Speed up IndexedDISI Sparse #AdvanceExactWithinBlock for tiny step advance (apache#12324) * Add checks in KNNVectorField / KNNVectorQuery to only allow non-null, non-empty and finite vectors (apache#12281) --------- Co-authored-by: Uwe Schindler <uschindler@apache.org> * Implement VectorUtilProvider with Java 21 Project Panama Vector API (apache#12363) (apache#12365) This commit enables the Panama Vector API for Java 21. The version of VectorUtilPanamaProvider for Java 21 is identical to that of Java 20. As such, there is no specific 21 version - the Java 20 version will be loaded from the MRJAR. * Add CHANGES.txt for apache#12334 Honor after value for skipping documents even if queue is not full for PagingFieldCollector (apache#12368) Signed-off-by: gashutos <gashutos@amazon.com> * Move TermAndBoost back to its original location. (apache#12366) PR apache#12169 accidentally moved the `TermAndBoost` class to a different location, which would break custom sub-classes of `QueryBuilder`. This commit moves it back to its original location. * GITHUB-12252: Add function queries for computing similarity scores between knn vectors (apache#12253) Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io> * hunspell (minor): reduce allocations when processing compound rules (apache#12316) (cherry picked from commit a454388) * hunspell (minor): reduce allocations when reading the dictionary's morphological data (apache#12323) there can be many entries with morph data, so we'd better avoid compiling and matching regexes and even stream allocation (cherry picked from commit 4bf1b94) * TestHunspell: reduce the flakiness probability (apache#12351) * TestHunspell: reduce the flakiness probability We need to check how the timeout interacts with custom exception-throwing checkCanceled. The default timeout seems not enough for some CI agents, so let's increase it. Co-authored-by: Dawid Weiss <dawid.weiss@gmail.com> (cherry picked from commit 5b63a18) * This allows VectorUtilProvider tests to be executed although hardware may not fully support vectorization or if C2 is not enabled (apache#12376) --------- Signed-off-by: gashutos <gashutos@amazon.com> Co-authored-by: Alan Woodward <romseygeek@apache.org> Co-authored-by: Luca Cavanna <javanna@apache.org> Co-authored-by: Uwe Schindler <uschindler@apache.org> Co-authored-by: Armin Braun <me@obrown.io> Co-authored-by: Mikhail Khludnev <mkhludnev@users.noreply.github.com> Co-authored-by: Marcus <marcuseagan@gmail.com> Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com> Co-authored-by: Kaival Parikh <46070017+kaivalnp@users.noreply.github.com> Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com> Co-authored-by: tang donghai <tangdhcs@gmail.com> Co-authored-by: Patrick Zhai <zhaih@users.noreply.github.com> Co-authored-by: Greg Miller <gsmiller@gmail.com> Co-authored-by: Jerry Chin <metrxqin@gmail.com> Co-authored-by: Patrick Zhai <zhai7631@gmail.com> Co-authored-by: Andrey Bozhko <andybozhko@gmail.com> Co-authored-by: Chris Hegarty <62058229+ChrisHegarty@users.noreply.github.com> Co-authored-by: Robert Muir <rmuir@apache.org> Co-authored-by: Jonathan Ellis <jbellis@datastax.com> Co-authored-by: Daniele Antuzi <daniele.antuzi@gmail.com> Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io> Co-authored-by: Chaitanya Gohel <104654647+gashutos@users.noreply.github.com> Co-authored-by: Petr Portnov | PROgrm_JARvis <pportnov@ozon.ru> Co-authored-by: Tomas Eduardo Fernandez Lobbe <tflobbe@apache.org> Co-authored-by: Ignacio Vera <ivera@apache.org> Co-authored-by: fudongying <30896830+fudongyingluck@users.noreply.github.com> Co-authored-by: Chris Fournier <chris.fournier@shopify.com> Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com> Co-authored-by: Adrien Grand <jpountz@gmail.com> Co-authored-by: Elia Porciani <e.porciani@sease.io> Co-authored-by: Peter Gromov <peter@jetbrains.com>

luyuncheng mentioned this pull request Jun 6, 2023

Reduce DeletesMerges time when softdelete enable #12350

Closed

jpountz reviewed Jun 8, 2023

View reviewed changes

fudongyingluck added 2 commits June 9, 2023 11:52

feat: soft delete optimize

d21ae98

cache numDeletesToMerge to reduce calculate

43454b9

fudongyingluck force-pushed the softDelete branch from 1cb6804 to 43454b9 Compare June 9, 2023 03:53

Revert "feat: soft delete optimize"

5aaa1d9

This reverts commit d21ae98.

fudongyingluck closed this Jun 9, 2023

fudongyingluck reopened this Jun 9, 2023

chore: add some test cases

44629fd

jpountz approved these changes Jun 9, 2023

View reviewed changes

fudongyingluck added 2 commits June 9, 2023 16:16

chore: pkg-private class as jpountz comment

366182d

chore: add change log

bcb8dac

fudongyingluck force-pushed the softDelete branch from 84a5b32 to bcb8dac Compare June 9, 2023 08:29

jpountz added this to the 9.7.0 milestone Jun 9, 2023

jpountz merged commit 2934899 into apache:main Jun 9, 2023
4 checks passed

jpountz pushed a commit that referenced this pull request Jun 9, 2023

feat: soft delete optimize (#12339)

1107aa2

dnhatn mentioned this pull request May 1, 2024

Duplication computation for TieredMergePolicy's numDeletesToMerge [LUCENE-10041] #11079

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: soft delete optimize #12339

feat: soft delete optimize #12339

fudongyingluck commented May 30, 2023 •

edited

Loading

fudongyingluck commented Jun 5, 2023 •

edited

Loading

fudongyingluck commented Jun 6, 2023 •

edited

Loading

jpountz left a comment

fudongyingluck commented Jun 9, 2023 •

edited

Loading

fudongyingluck commented Jun 9, 2023 •

edited

Loading

jpountz left a comment

jpountz Jun 9, 2023

fudongyingluck Jun 9, 2023

jpountz commented Jun 9, 2023

feat: soft delete optimize #12339

feat: soft delete optimize #12339

Conversation

fudongyingluck commented May 30, 2023 • edited Loading

fudongyingluck commented Jun 5, 2023 • edited Loading

fudongyingluck commented Jun 6, 2023 • edited Loading

jpountz left a comment

Choose a reason for hiding this comment

fudongyingluck commented Jun 9, 2023 • edited Loading

fudongyingluck commented Jun 9, 2023 • edited Loading

jpountz left a comment

Choose a reason for hiding this comment

jpountz Jun 9, 2023

Choose a reason for hiding this comment

fudongyingluck Jun 9, 2023

Choose a reason for hiding this comment

jpountz commented Jun 9, 2023

fudongyingluck commented May 30, 2023 •

edited

Loading

fudongyingluck commented Jun 5, 2023 •

edited

Loading

fudongyingluck commented Jun 6, 2023 •

edited

Loading

fudongyingluck commented Jun 9, 2023 •

edited

Loading

fudongyingluck commented Jun 9, 2023 •

edited

Loading