Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove lastPosBlockOffset from term metadata for Lucene90PostingsFormat #12536

Closed
Tony-X opened this issue Sep 3, 2023 · 6 comments
Closed

Comments

@Tony-X
Copy link
Contributor

Tony-X commented Sep 3, 2023

Description

lastPosBlockOffset is used to identify the last non-packed-encoded positions block. The same information can be derived use totalTermFreq.

This could save space taken by the FSTs for positions enabled fields.

@Tony-X
Copy link
Contributor Author

Tony-X commented Sep 3, 2023

I quickly tried this out and realized it doesn't work when we need to skip with the skip lists we we lost track of how many positions that got skipped.

@mikemccand
Copy link
Member

Sorry, what went wrong when you tried to remove lastPosBlockOffset? Skipping is (currently) only at every 8 blocks boundaries right? So doesn't the skip data already know how many positions were skipped when it skips into a spot?

Hmm I think I see -- there are in general more positions blocks (since there can of course be many more positions than doc/freq) than doc/freq blocks, and so as PostingsEnum is skipping or reading positions, it must know when it transitions into the final (vInt encoded) chunk of positions? Hmm but again it seems like tracking number of positions skipped/read vs totalTermFreq should be enough?

Maybe make a PR adding a nice comment explaining why we really do need this int lastPosBlockOffset?

@Tony-X
Copy link
Contributor Author

Tony-X commented Sep 5, 2023

Today when it skips, the skipper can tell us 1) the offset of the position block we should seek to 2) how many positions it needs to skip within this block. This is because post-skipping the first doc's position is not always at the start of a positions block (that's actually the most common case). Additionally, advancing to a specific doc that is not the start of a block means we need to skip those docs as well as their positions.

In theory, if the skipper can tell us how many positions it has skipped that would work. This will require storing more information in the skip data than the current scheme.

Maybe make a PR adding a nice comment explaining why we really do need this int lastPosBlockOffset?
Sure, I can enhance the documentation of IntBlockTermState

@Tony-X
Copy link
Contributor Author

Tony-X commented Sep 6, 2023

#12541 adds more comment for lastPosBlockOffset

@mikemccand
Copy link
Member

In theory, if the skipper can tell us how many positions it has skipped that would work. This will require storing more information in the skip data than the current scheme.

OK, and it seems like that would not be a good tradeoff? Every skip entry would need to record how many total positions were skipped, whereas the lastPosBlockOffset is just a single long for the entire postings lists? Thanks for the PR improving the docs -- I'll review.

@Tony-X
Copy link
Contributor Author

Tony-X commented Sep 12, 2023

#12541 is merged and I'll close this one

@Tony-X Tony-X closed this as completed Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants