Can we decrease the overhead of skipping? #13106

jpountz · 2024-02-15T08:53:49Z

Description

On top-k queries, Lucene is now competitive with Tantivy/PISA on https://tantivy-search.github.io/bench/, but it's still quite slower on counting queries. This made me want to run a similar experiment as Tony-X/search-benchmark-game#44, though with a few more changes to how skipping works:

Single level of skip lists.
Skip data and impacts are inlined between blocks of postings.
Less overhead:
- no separate SkipReader abstraction that gets lazily instantiated: the skipping logic is more lightweight and within the postings/impacts enum logic,
- checking whether to skip and decode a new block is now a single check on BlockDocsEnum while it requires two different checks today.

A hacky implementation of this can be found at https://github.com/apache/lucene/compare/main...jpountz:lucene:skip_experiment?expand=1:

It doesn't replace existings skip data, just adds additional skip data and impacts inlined between blocks.
Only BlockDocsEnum and BlockImpactsDocsEnum switched to this new skip data, other impls still use existing skip data. So term queries will see a change, but not phrase queries.

It's quite naive, we could probably do something that is a bit more efficient. Yet results on wikibigall are interesting:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                    OrHighNotLow      327.87      (7.9%)      183.13      (5.4%)  -44.1% ( -53% -  -33%) 0.000
                    OrHighNotMed      374.44      (6.7%)      242.58      (5.0%)  -35.2% ( -43% -  -25%) 0.000
               HighTermMonthSort     4971.86      (1.6%)     3322.82      (1.1%)  -33.2% ( -35% -  -30%) 0.000
                        HighTerm      480.63      (6.5%)      343.22      (5.2%)  -28.6% ( -37% -  -18%) 0.000
                   OrHighNotHigh      226.71      (7.4%)      169.91      (6.2%)  -25.1% ( -36% -  -12%) 0.000
                         MedTerm      705.95      (5.8%)      617.27      (5.1%)  -12.6% ( -22% -   -1%) 0.000
                   OrNotHighHigh      183.15      (6.7%)      161.14      (6.3%)  -12.0% ( -23% -    1%) 0.000
                 CountOrHighHigh       56.83     (16.6%)       53.32     (10.5%)   -6.2% ( -28% -   25%) 0.160
                      OrHighHigh       73.45      (1.9%)       71.56      (1.3%)   -2.6% (  -5% -    0%) 0.000
                      TermDTSort      259.01      (4.1%)      253.64      (5.8%)   -2.1% ( -11% -    8%) 0.193
            HighTermTitleBDVSort       15.83      (6.0%)       15.60      (4.9%)   -1.4% ( -11% -   10%) 0.417
                     AndHighHigh       71.57      (2.2%)       70.60      (1.8%)   -1.4% (  -5% -    2%) 0.032
                       CountTerm    14081.51      (4.0%)    13918.96      (3.9%)   -1.2% (  -8% -    6%) 0.353
               HighTermTitleSort      112.66      (5.3%)      112.15      (2.4%)   -0.5% (  -7% -    7%) 0.724
                         Prefix3      340.35      (3.4%)      339.24      (2.9%)   -0.3% (  -6% -    6%) 0.745
                        Wildcard      106.77      (3.2%)      106.51      (3.5%)   -0.2% (  -6% -    6%) 0.824
                        PKLookup      282.16      (2.0%)      281.52      (1.3%)   -0.2% (  -3% -    3%) 0.667
                     LowSpanNear       13.84      (2.8%)       13.81      (2.6%)   -0.2% (  -5% -    5%) 0.839
                     CountPhrase        3.19      (8.0%)        3.19      (9.1%)   -0.1% ( -15% -   18%) 0.974
                      HighPhrase       29.89      (2.9%)       29.90      (4.7%)    0.0% (  -7% -    7%) 0.985
                     MedSpanNear        9.92      (2.3%)        9.93      (2.3%)    0.0% (  -4% -    4%) 0.952
                    HighSpanNear        5.37      (3.7%)        5.37      (3.4%)    0.1% (  -6% -    7%) 0.923
                         LowTerm     1154.36      (4.2%)     1155.62      (3.6%)    0.1% (  -7% -    8%) 0.930
                 MedSloppyPhrase       25.88      (2.7%)       25.93      (1.8%)    0.2% (  -4% -    4%) 0.789
                         Respell       59.03      (1.6%)       59.16      (1.6%)    0.2% (  -2% -    3%) 0.661
                       LowPhrase       48.93      (2.4%)       49.10      (4.7%)    0.4% (  -6% -    7%) 0.763
           HighTermDayOfYearSort      437.52      (1.2%)      440.03      (1.7%)    0.6% (  -2% -    3%) 0.227
                  CountOrHighMed      108.93     (11.3%)      109.65      (7.7%)    0.7% ( -16% -   22%) 0.830
                       MedPhrase       26.25      (2.6%)       26.44      (5.1%)    0.7% (  -6% -    8%) 0.580
                HighSloppyPhrase        6.35      (3.5%)        6.40      (2.1%)    0.8% (  -4% -    6%) 0.361
                          Fuzzy2       72.65      (1.1%)       73.27      (1.2%)    0.9% (  -1% -    3%) 0.022
                          Fuzzy1       90.62      (1.1%)       91.49      (1.2%)    1.0% (  -1% -    3%) 0.008
             LowIntervalsOrdered       15.56      (3.6%)       15.74      (3.4%)    1.2% (  -5% -    8%) 0.287
            HighIntervalsOrdered        3.05      (5.0%)        3.10      (5.0%)    1.6% (  -8% -   12%) 0.323
                 LowSloppyPhrase       18.12      (5.2%)       18.41      (2.3%)    1.6% (  -5% -    9%) 0.217
                       OrHighLow      658.61      (2.4%)      670.64      (1.9%)    1.8% (  -2% -    6%) 0.008
             MedIntervalsOrdered       18.21      (4.2%)       18.59      (3.7%)    2.1% (  -5% -   10%) 0.096
                    OrNotHighMed      337.44      (4.1%)      350.80      (4.5%)    4.0% (  -4% -   13%) 0.004
                       OrHighMed      147.53      (2.6%)      153.47      (2.1%)    4.0% (   0% -    8%) 0.000
                          IntNRQ      130.99     (21.2%)      139.01     (21.2%)    6.1% ( -29% -   61%) 0.360
                      AndHighMed      270.68      (2.4%)      291.75      (2.6%)    7.8% (   2% -   13%) 0.000
                      AndHighLow      903.88      (2.2%)     1000.22      (3.1%)   10.7% (   5% -   16%) 0.000
                    OrNotHighLow      793.34      (2.3%)      935.56      (2.1%)   17.9% (  13% -   22%) 0.000
                CountAndHighHigh       43.99      (2.1%)       51.94      (3.3%)   18.1% (  12% -   23%) 0.000
                 CountAndHighMed      129.42      (2.3%)      155.03      (3.4%)   19.8% (  13% -   26%) 0.000

CountAndHighHigh and CountAndHighMed became almost 20% faster! These are the main targets that I was targeting with this change, so it's good to see they are seeing a significant speedup. This confirms that we have some non-negligible overhead for skipping today though it's not easy to tell how much comes from the additional abstractions vs. multiple levels of skip lists.

OrNotHighLow and OrNotHighMed are faster. This is because the bottleneck of these queries is advancing the MUST_NOT clause, which are not scoring. So it's very similar to the speedup we're seeing on the counting queries.

AndHighLow and AndHighMed are 8%-11% faster. Again, I would attribute this to the faster skipping logic since this is about clauses that have different doc frequencies, so the higher frequency clause will need to do a lot of skipping to catch up with the leading clause. Interestingly, the fact that we are storing a single level of impact data doesn't hurt.

AndHighHigh and OrHighHigh are slightly slower (or is it noise?). I could believe that there is a small performance hit on this one due to having a single level of impact data. This forces Lucene to use the maximum score across the entire doc ID space as a score upper bound for the clause that has the higher cost. Maybe it could be enough to compute global impacts to have better performance on these queries by having slightly better score upper bounds for the following clause.

HighTerm, MedTerm, OrHighNotLow, OrHighNotMed, HighTerm, OrHighNotHigh, OrNotHighHigh are slower. This is expected as there are queries that have a single positive clause, which in-turn are queries where the score upper bounds that we compute are very close to the actual produced scores, which in-turn enables these queries to take advantage of the higher levels of impacts to skip more docs at once.

HighTermMonthSort is slower. This is because the sort dynamically introduces a filter that is so selective that the term query can take advantage of skip data on higher levels to skip more docs at once.

CountOrHighHigh is a bit slower because there's a bit more overhead to collect postings lists exhaustively now that skip data and impacts are inlined.

It's not a net win, but this suggests that we have some room for improvement here.

The text was updated successfully, but these errors were encountered:

mikemccand · 2024-02-15T14:57:39Z

Whoa, very cool @jpountz! This reminds me of this longstanding issue/paper which also inlined skip data directly in the postings, but maybe was still multi-level?

jpountz added the type:enhancement label Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we decrease the overhead of skipping? #13106

Can we decrease the overhead of skipping? #13106

jpountz commented Feb 15, 2024

mikemccand commented Feb 15, 2024

Can we decrease the overhead of skipping? #13106

Can we decrease the overhead of skipping? #13106

Comments

jpountz commented Feb 15, 2024

Description

mikemccand commented Feb 15, 2024