Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore partially decoding blocks (within-block skipping) #12749

Open
slow-J opened this issue Nov 2, 2023 · 2 comments
Open

Explore partially decoding blocks (within-block skipping) #12749

slow-J opened this issue Nov 2, 2023 · 2 comments

Comments

@slow-J
Copy link
Contributor

slow-J commented Nov 2, 2023

Description

Idea from @mikemccand 's comment in #12696 (comment)

Another exciting optimization such a "patch-less" encoding could implement is within-block skipping (I believe Tantivy does this).

Today, our skipper is forced to align to block boundaries, so when we skip to a given docid, we go to the block that may contain this docid, decode all 128 int[], then linearly scan within those 128 ints. This is quite a bit of overhead for each skip request!

If we could lower that linear scan cost to maybe 16 or 8 or something, the conjunctive queries should get even faster. But perhaps it becomes trickier to take advantage of SIMD optimizations if we are decoding a subset of ints, not sure.

After the change in #12741 , we will no longer use patching when encoding doc blocks.
This may allow us to partially decode blocks? This would mean skipping could jump to the middle of a block, instead of having to be at block boundaries as they are today.

@jpountz
Copy link
Contributor

jpountz commented Nov 3, 2023

How would it work? Since blocks are delta-coded, you can't know the value at a given index without decoding all previous values and computing their sum? Or you need to store some checkpoints separately, but then it may be easier/better to simply go with smaller blocks (e.g. 64 doc IDs instead of 128)?

@Tony-X
Copy link
Contributor

Tony-X commented Nov 7, 2023

How would it work? Since blocks are delta-coded, you can't know the value at a given index without decoding all previous values and computing their sum? Or you need to store some checkpoints separately, but then it may be easier/better to simply go with smaller blocks (e.g. 64 doc IDs instead of 128)?

+1. Delta-encoding here is the blocker, unless we change the encoding scheme.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants