Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up docvalues set query by making use of sortedness #12128

Merged
merged 2 commits into from
Feb 6, 2023

Conversation

rmuir
Copy link
Member

@rmuir rmuir commented Feb 4, 2023

LongHashSet is used for the set of numbers, but it has some issues:

  • tries too hard to extend AbstractSet, mostly for testing
  • causes traps with boxing if you aren't careful
  • complex hashcode/equals

Practically we should take advantage of the fact numbers come in sorted order for multivalued fields: just like range queries do. So we use min/max to our advantage, including termination of docvalues iteration.

Actually it is generally a win to just check min/max even in the single-valued case: these constant time comparisons are cheap and can avoid hashing, etc.

In the worst-case, if all of your query Sets contain both the minimum and maximum possible values, then it won't help, but it doesn't hurt either.

LongHashSet is used for the set of numbers, but it has some issues:
* tries to hard to extend AbstractSet, mostly for testing
* causes traps with boxing if you aren't careful
* complex hashcode/equals

Practically we should take advantage of the fact numbers come in sorted
order for multivalued fields: just like range queries do. So we use
min/max to our advantage, including termination of docvalues iteration

Actually it is generally a win to just check min/max even in the single-valued
case: these constant time comparisons are cheap and can avoid hashing,
etc.

In the worst-case, if all of your query Sets contain both the minimum and maximum
possible values, then it won't help, but it doesn't hurt either.
Copy link
Contributor

@gsmiller gsmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice min/max optimization. LGTM

@rmuir rmuir merged commit 10d9c74 into apache:main Feb 6, 2023
asfgit pushed a commit that referenced this pull request Feb 6, 2023
LongHashSet is used for the set of numbers, but it has some issues:
* tries to hard to extend AbstractSet, mostly for testing
* causes traps with boxing if you aren't careful
* complex hashcode/equals

Practically we should take advantage of the fact numbers come in sorted
order for multivalued fields: just like range queries do. So we use
min/max to our advantage, including termination of docvalues iteration

Actually it is generally a win to just check min/max even in the single-valued
case: these constant time comparisons are cheap and can avoid hashing,
etc.

In the worst-case, if all of your query Sets contain both the minimum and maximum
possible values, then it won't help, but it doesn't hurt either.
@rmuir rmuir added this to the 9.6.0 milestone Feb 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants