New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip data should be inlined into the postings lists [LUCENE-2962] #4036
Comments
LiLi (migrated from JIRA) I am interested in this issue. anyone could tell me more detailed things about this? Such as papers or related stuff? |
Michael McCandless (@mikemccand) (migrated from JIRA) I think this paper is relevant: http://vigna.dsi.unimi.it/ftp/papers/CompressedPerfectEmbeddedSkipLists.pdf |
Han Jiang (@sleepsort) (migrated from JIRA) Hi, here is my understanding about this issue (after discussion with Mike), hope this can be a right summary: Extra penalty on current impl:
And, to inline skip data into postings list, there will be something to dig more:
|
Han Jiang (@sleepsort) (migrated from JIRA) I'm very interested in this project, and I hope this will make a good project as GSOC this year! The attachment is a summary of some thoughts about this project, comments are welcome! :) |
Han Jiang (@sleepsort) (migrated from JIRA) Hmm, as for the average skip length, I think a histogram might be better, I'll add this later. |
Han Jiang (@sleepsort) (migrated from JIRA) A full summary of skip frequency in wikimedium.10M.nostopwords.tasks, and part of crazyRandomMinShouldMatch.tasks. The latter one is Really crazy :)
This drives me to test, whether it is really necessary to use multi-level skip structure for simpler queries like AndQuery & PhraseQuery.
|
Michael McCandless (@mikemccand) (migrated from JIRA) Hi Billy, The proposal looks good! I think it needs some milestones with dates ... I would separate the And perhaps add some more detail about the design of the postings Separately, it's curious we have no tasks that are hurt that much from |
Han Jiang (@sleepsort) (migrated from JIRA) Oh, sorry I didn't made it clear: All the tests above were already done on wikimediumfull, which is using WIKI_MEDIUM_TASKS_10MDOCS_FILE. The crazyMinShouldMatch benefits much from skipper (as is expected from the crazy avg_len :) ),
|
Han Jiang (@sleepsort) (migrated from JIRA) And... sorry Mike, and sorry to all of you that I'm so hasty to hand in the proposal. I really would like to share my thoughts and discoveries with all of you. I'll be really grateful if someone can see further and take this issue :). But if this issue is still |
Michael McCandless (@mikemccand) (migrated from JIRA)
No need to apologize! This is how it works :) Open source development is rarely a "straight line", and that's one thing that makes it fun!
Well, one thing we could do is still leave all skipData at the "end" of the postings (so an additional seek is required to go load it), but have it block encoded so that as you decode the postings you also go and read blocks from the skipData. This way non-skipping queries would pay no penalty ... |
Han Jiang (@sleepsort) (migrated from JIRA) Thank you, Mike |
Today, we store all skip data as a separate blob at the end of a given term's postings (if that term occurs in enough docs to warrant skip data).
But this adds overhead during decoding – we have to seek to a different place for the initial load, we have to init separate readers, we have to seek again while using the lower levels of the skip data, etc. Also, we have to fully decode all skip information even if we are not going to use it (eg if I only want docIDs, I still must decode position offset and lastPayloadLength).
If instead we interleaved skip data into the postings file, we could keep it local, and "private" to each file that needs skipping. This should make it least costly to init and then use the skip data, which'd be a good perf gain for eg PhraseQuery, AndQuery.
Migrated from LUCENE-2962 by Michael McCandless (@mikemccand), updated Jan 29 2014
Attachments: proposal.txt
The text was updated successfully, but these errors were encountered: