Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10396: Add capability to jump to the next document with different ord in SortedDocValues #979

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

iverase
Copy link
Contributor

@iverase iverase commented Jun 24, 2022

This PR proposes to add a new method to SortedDocValues that helps users to advance an iterator to the next document that contains a different term that the current document, which can be specially useful when the index is sorted by this field.

The method contains a default implementation but this PR produces as well a fast implementation when the index is sorted by this field and it has low cardinality. In this case we write to disk a jump table that allows to quickly skip documents instead of manually iterating through the docs.

In https://issues.apache.org/jira/browse/LUCENE-10396 it is discussed some of the use cases where this method can be used, for example computing the number of unique values for documents that match a query. On the other hand, it diverges from the sparse index approach but as this ids less intrusive, it seems appealing.

Note that in order to handle backwards compatibility, I have increase the version of the codec instead of creating a new one.

#11432

@iverase iverase marked this pull request as draft June 24, 2022 12:45
@iverase
Copy link
Contributor Author

iverase commented Jun 28, 2022

I make a quick check if this patch by indexing 50 million documents in a sorted index. The documents just contain a SortedDocValues with a 10 bytes term. I checked the index size and the speed of retrieving the first document per term with different cardinalities and the results looks like:

Cardinality ~1000

                       |  without patch       | with patch           
Index Size (MB)        |  2.800084114074707   |  2.8039379119873047 
average advanceOrd (s) |  0.39255053534999995 |  0.0011012437999999999

Cardinality ~10000

                       |  without patch       | with patch                             
Index Size (MB)        |  16.125946044921875  |  16.164132118225098
average advanceOrd (s) |  0.52939177705       |  0.01008831655

Cardinality ~10000

                       |  without patch       | with patch           
Index Size (MB)        | 49.320682525634766   |  49.57721138000488
average advanceOrd (s) |  0.5479114709999999  |  0.03804306865

Cardinality ~50000

                       |  without patch       | with patch           
Index Size (MB)        |   52.81498718261719  |  53.66002082824707 
average advanceOrd (s) |   0.6515335270999999 |  0.06898821255000001

The new jump table is tiny compared to the size of the doc value while this new way of navigation os at least one order of magnitude faster.

@gsmiller
Copy link
Contributor

@iverase I was playing with this idea a little bit for a use-case I'm working on. It didn't pan out unfortunately, but in the process, I did take the time to rebase this change on the tip of main. Do you mind if I push the rebase to your PR branch since I did the work to rebase?

@iverase
Copy link
Contributor Author

iverase commented Mar 30, 2023

Sure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants