Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip data should be inlined into the postings lists [LUCENE-2962] #4036

Open
asfimport opened this issue Mar 10, 2011 · 11 comments
Open

Skip data should be inlined into the postings lists [LUCENE-2962] #4036

asfimport opened this issue Mar 10, 2011 · 11 comments

Comments

@asfimport
Copy link

Today, we store all skip data as a separate blob at the end of a given term's postings (if that term occurs in enough docs to warrant skip data).

But this adds overhead during decoding – we have to seek to a different place for the initial load, we have to init separate readers, we have to seek again while using the lower levels of the skip data, etc. Also, we have to fully decode all skip information even if we are not going to use it (eg if I only want docIDs, I still must decode position offset and lastPayloadLength).

If instead we interleaved skip data into the postings file, we could keep it local, and "private" to each file that needs skipping. This should make it least costly to init and then use the skip data, which'd be a good perf gain for eg PhraseQuery, AndQuery.


Migrated from LUCENE-2962 by Michael McCandless (@mikemccand), updated Jan 29 2014
Attachments: proposal.txt

@asfimport
Copy link
Author

LiLi (migrated from JIRA)

I am interested in this issue. anyone could tell me more detailed things about this? Such as papers or related stuff?

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

I think this paper is relevant: http://vigna.dsi.unimi.it/ftp/papers/CompressedPerfectEmbeddedSkipLists.pdf

@asfimport
Copy link
Author

Han Jiang (@sleepsort) (migrated from JIRA)

Hi, here is my understanding about this issue (after discussion with Mike), hope this can be a right summary:

Extra penalty on current impl:

  1. Skip data for both doc(.doc) and positions(.pos) are gathered inside the same blob(in .doc). For non-proximity queries, it takes unnecessary decode time.
  2. In MultiLevelSkipReader, each level of skip takes an inputstream, while they are jumping inside the same file, along with the docIn loading docid/freqs. If skip data are just interleaved, the jumping behavior might be less frequent (as is said "private") for IO cache.

And, to inline skip data into postings list, there will be something to dig more:

  1. Buffering, since we can hardly predict which FP to skip to, we might have to buffer the following postings data in memory to calculate FP offset.
  2. The file structure of MultiLevelSkipWriter(even with skip blob splitted) is still a little bit different from the paper in Mike's comment, which can be illustrated by Figure.4 vs Figure 7 in that paper.

@asfimport
Copy link
Author

Han Jiang (@sleepsort) (migrated from JIRA)

I'm very interested in this project, and I hope this will make a good project as GSOC this year! The attachment is a summary of some thoughts about this project, comments are welcome! :)

@asfimport
Copy link
Author

Han Jiang (@sleepsort) (migrated from JIRA)

Hmm, as for the average skip length, I think a histogram might be better, I'll add this later.

@asfimport
Copy link
Author

Han Jiang (@sleepsort) (migrated from JIRA)

A full summary of skip frequency in wikimedium.10M.nostopwords.tasks, and part of crazyRandomMinShouldMatch.tasks. The latter one is Really crazy :)
The 'skip_len' is actually counted as (newDocUpto-docUpto) in Lucene41PostingsReader.*Enum.advance(target), when skip doesn't move,
counted as 0, otherwise the number of docs skipped. I changed codes in luceneutil so that, each line of query is taken into account:

#query_category      #num_query  #num called  #max_skip_len  #tot_skip_len  #avg_skip_len  #std_dev_skip_len
and_high_high:       500         18021935     14633          110997027      6.158996       25.283280
and_high_med:        500         9145928      22730          233710779      25.553534      61.885853
and_high_low:        500         1385125      215533         1073755035     775.204429     1606.345686
high_phrase:         42          253569       3284           5113544        20.166282      56.904256
high_sloppy_phrase:  42          2441007      3284           11993572       4.913371       23.253660
high_span_near:      42          2362258      3284           11846707       5.014993       23.604965
low_phrase:          500         6936508      21180          247018573      35.611373      103.751734
low_sloppy_phrase:   500         18170618     21180          298025713      16.401518      66.808480
low_span_near:       500         18100056     21180          296733920      16.394089      66.895263
med_phrase:          500         4513849      26367          144556764      32.025166      83.814376
med_sloppy_phrase:   500         17683175     26367          197756027      11.183287      45.764898
med_span_near:       500         17503372     26367          196409780      11.221254      45.958612
10terms_0high_2msm:  22          10875        32768          2502731        230.136184     1319.894640
10terms_0high_3msm:  17          17127        15743          440149         25.699130      209.870841
10terms_0high_4msm:  27          27144        24192          2156919        79.462091      640.948445 
10terms_0high_5msm:  21          19564        26479          1829846        93.531282      773.820054
10terms_0high_6msm:  27          17555        31232          1615071        92.000627      745.978516
10terms_0high_7msm:  27          16618        18688          1208893        72.745998      505.996915 
10terms_0high_8msm:  25          10722        17024          817872         76.279799      451.833907
10terms_0high_9msm:  16          5371         11008          411776         76.666543      353.098379
10terms_0high_10msm: 21          10403        32768          7325395        704.161780     2504.260576
10terms_5high_2msm:  24          650096       2163           1832245        2.818422       18.513591
10terms_5high_5msm:  24          1123877      276224         128339073      114.193166     936.887693
10terms_5high_10msm: 24          14211        1663232        322730000      22709.872634   115150.031194

This drives me to test, whether it is really necessary to use multi-level skip structure for simpler queries like AndQuery & PhraseQuery.
So I set skipMultiplier=8000000 to make sure that Lucene41SkipWriter won't create a level >1 skip list, which is marked as 'comp'.
And a clean trunk (skipMultiplier=8) used as 'base':

                    Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
               LowPhrase       34.86      (2.7%)       32.23      (1.6%)   -7.5% ( -11% -   -3%)
                 LowTerm      335.88      (8.8%)      326.09      (7.8%)   -2.9% ( -17% -   14%)
            HighSpanNear        7.05      (2.2%)        6.97      (0.7%)   -1.1% (  -3% -    1%)
              AndHighMed       52.22      (1.3%)       51.72      (1.0%)   -1.0% (  -3% -    1%)
             MedSpanNear        4.30      (2.1%)        4.26      (0.7%)   -0.8% (  -3% -    2%)
             LowSpanNear       42.46      (1.7%)       42.28      (0.6%)   -0.4% (  -2% -    1%)
                  Fuzzy2       59.56      (5.4%)       59.31      (4.7%)   -0.4% (  -9% -   10%)
         LowSloppyPhrase       10.33      (2.6%)       10.30      (2.5%)   -0.3% (  -5% -    4%)
             AndHighHigh       18.37      (0.6%)       18.33      (0.3%)   -0.2% (  -1% -    0%)
                  Fuzzy1       53.70      (5.3%)       53.59      (5.2%)   -0.2% ( -10% -   10%)
              HighPhrase        2.56      (6.5%)        2.56      (5.6%)   -0.2% ( -11% -   12%)
                HighTerm       57.36     (15.2%)       57.34     (15.0%)   -0.0% ( -26% -   35%)
                 MedTerm       90.08     (13.9%)       90.30     (13.6%)    0.2% ( -23% -   32%)
                  IntNRQ        2.82     (13.3%)        2.83     (11.7%)    0.3% ( -21% -   29%)
               MedPhrase       15.18      (8.6%)       15.23      (8.4%)    0.3% ( -15% -   19%)
         MedSloppyPhrase        2.17      (4.2%)        2.18      (3.7%)    0.6% (  -6% -    8%)
               OrHighMed       20.30     (14.5%)       20.47     (14.4%)    0.8% ( -24% -   34%)
                Wildcard       21.53      (5.6%)       21.71      (4.9%)    0.8% (  -9% -   12%)
               OrHighLow       17.26     (15.0%)       17.43     (15.0%)    1.0% ( -25% -   36%)
        HighSloppyPhrase        8.31      (4.2%)        8.39      (4.5%)    1.0% (  -7% -   10%)
                 Prefix3       22.70      (5.8%)       22.93      (5.1%)    1.0% (  -9% -   12%)
              OrHighHigh       15.51     (14.7%)       15.69     (14.8%)    1.2% ( -24% -   35%)
                 Respell       41.39      (3.4%)       42.01      (3.3%)    1.5% (  -5% -    8%)
              AndHighLow      459.48      (2.3%)      468.22      (2.1%)    1.9% (  -2% -    6%)
                PKLookup      251.05      (4.4%)      259.80      (2.8%)    3.5% (  -3% -   11%)

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Hi Billy,

The proposal looks good!

I think it needs some milestones with dates ... I would separate the
"dirt path": getting a basic vInt based impl working, probably first
index-time (writer) and then the reader, from experiments like
different ways of compressing the skip data, performance experiments
across different skip settings / corpora, etc.

And perhaps add some more detail about the design of the postings
format, ie skip blocks will be interleaved into each posting stream,
etc.

Separately, it's curious we have no tasks that are hurt that much from
only single-level skipping (though we should test the crazy
minShouldMatch tasks too). I think we need a corpus with more
documents? Maybe try wikimediumfull (33.3M) instead of just the 10M?

@asfimport
Copy link
Author

Han Jiang (@sleepsort) (migrated from JIRA)

Oh, sorry I didn't made it clear:

All the tests above were already done on wikimediumfull, which is using WIKI_MEDIUM_TASKS_10MDOCS_FILE.

The crazyMinShouldMatch benefits much from skipper (as is expected from the crazy avg_len :) ),
and the result is below:

                    Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
       10Terms8High10MSM      322.25      (2.5%)       97.87      (0.9%)  -69.6% ( -71% -  -67%)
       10Terms4High10MSM      449.00      (2.1%)      194.73      (1.2%)  -56.6% ( -58% -  -54%)
       10Terms6High10MSM      611.10      (2.6%)      327.45      (1.4%)  -46.4% ( -49% -  -43%)
       10Terms2High10MSM      614.20      (2.6%)      472.07      (1.9%)  -23.1% ( -26% -  -19%)
        10Terms6High8MSM       61.24      (5.9%)       56.10      (5.6%)   -8.4% ( -18% -    3%)
        10Terms4High6MSM      104.63      (4.9%)      100.22      (5.0%)   -4.2% ( -13% -    5%)
        10Terms4High2MSM        6.31      (7.8%)        6.12      (8.7%)   -3.0% ( -18% -   14%)
        10Terms6High4MSM        1.75      (6.6%)        1.70      (7.3%)   -2.9% ( -15% -   11%)
        10Terms2High4MSM       31.74      (6.5%)       30.85      (7.4%)   -2.8% ( -15% -   11%)
        10Terms2High2MSM        5.30      (7.0%)        5.16      (8.0%)   -2.6% ( -16% -   13%)
        10Terms8High4MSM        0.87      (5.8%)        0.85      (6.3%)   -2.4% ( -13% -   10%)
        10Terms0High8MSM      216.98      (4.1%)      211.76      (4.9%)   -2.4% ( -10% -    6%)
        10Terms6High2MSM        0.92      (5.3%)        0.90      (6.0%)   -2.3% ( -12% -    9%)
        10Terms2High8MSM      115.45      (4.8%)      113.28      (5.1%)   -1.9% ( -11% -    8%)
        10Terms4High8MSM      209.93      (4.4%)      206.04      (4.8%)   -1.9% ( -10% -    7%)
        10Terms8High8MSM       11.03      (6.8%)       10.85      (8.1%)   -1.7% ( -15% -   14%)
        10Terms6High6MSM        9.30      (6.8%)        9.15      (8.0%)   -1.7% ( -15% -   14%)
        10Terms0High2MSM       27.76      (6.9%)       27.30      (8.4%)   -1.6% ( -15% -   14%)
        10Terms4High3MSM        4.34      (7.0%)        4.27      (8.2%)   -1.6% ( -15% -   14%)
        10Terms8High6MSM        3.06      (7.1%)        3.01      (8.3%)   -1.5% ( -15% -   14%)
        10Terms8High2MSM        2.33      (6.5%)        2.30      (7.5%)   -1.2% ( -14% -   13%)
        10Terms4High4MSM        8.77      (6.6%)        8.67      (8.1%)   -1.2% ( -14% -   14%)
        10Terms0High6MSM       77.21      (5.7%)       76.71      (5.9%)   -0.7% ( -11% -   11%)
        10Terms2High6MSM       73.82      (5.7%)       73.40      (6.1%)   -0.6% ( -11% -   11%)
        10Terms0High4MSM       63.80      (5.9%)       63.64      (6.3%)   -0.2% ( -11% -   12%)
       10Terms0High10MSM      595.12      (2.4%)      595.54      (2.4%)    0.1% (  -4% -    5%)
                PKLookup      244.34      (3.1%)      259.97      (3.0%)    6.4% (   0% -   12%)

@asfimport
Copy link
Author

Han Jiang (@sleepsort) (migrated from JIRA)

And... sorry Mike, and sorry to all of you that I'm so hasty to hand in the proposal.

I really would like to share my thoughts and discoveries with all of you.
But for this issue as GSoC, I'm quite in doubt how much improvement we might gain finally.
When interleaving skip data between docid/freq blocks, the performance loss on non-skip queries
still seems to be unavoidable. And the one-level-skipper experiment above shows that we should
be really cautious if we're going to sacrifice simplicity and introduce a more complex structure
of skip list.

I'll be really grateful if someone can see further and take this issue :). But if this issue is still
unassigned after GSoC days, I'll be very glad to do more experiment on it. :)

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

And... sorry Mike, and sorry to all of you that I'm so hasty to hand in the proposal.

No need to apologize! This is how it works :) Open source development is rarely a "straight line", and that's one thing that makes it fun!

When interleaving skip data between docid/freq blocks, the performance loss on non-skip queries still seems to be unavoidable.

Well, one thing we could do is still leave all skipData at the "end" of the postings (so an additional seek is required to go load it), but have it block encoded so that as you decode the postings you also go and read blocks from the skipData. This way non-skipping queries would pay no penalty ...

@asfimport
Copy link
Author

Han Jiang (@sleepsort) (migrated from JIRA)

Thank you, Mike

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant