New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-8968: Improve performance of WITHIN and DISJOINT queries for Shape queries #857
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should keep state in the visitor given that the API doesn't guarantee that when visit is called then all documents are within the rectangle that was passed in the previous compare call. This sounds like another reason to look into moving the API from a visitor to a cursor that could walk the tree index freely.
lucene/sandbox/src/java/org/apache/lucene/document/ShapeQuery.java
Outdated
Show resolved
Hide resolved
lucene/sandbox/src/java/org/apache/lucene/document/ShapeQuery.java
Outdated
Show resolved
Hide resolved
Thanks @jpountz , I agree we should not be keeping a state in the visitor. I updated the PR going back to two passes but using only one dense bitset. In the case where |
I have run the performance benchmark defined here which uses around ~13M polygons with a distribution similar to luceneutil geo benchmarks. The result with this approach is better for within and disjoint. Still performance for WITHIN or DISJOINT queries that match only few documents is not good as it needs to visit most of the documents.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at tests, do we already have good coverage for the sparse case?
if (hasAnyHits(query, values) == false) { | ||
// no hits so we can return | ||
return new ConstantScoreScorer(weight, boost, scoreMode, DocIdSetIterator.empty()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it'd be slightly better to handle this case in the scorer supplier to return a null scorer
# Conflicts: # lucene/sandbox/src/java/org/apache/lucene/document/ShapeQuery.java
scorer supplier.
@jpountz thanks! Yes, the random test seems to hit all the possible combinations so I think we are good there. I moved the check for adversarial case to the scorer supplier so we can return a null scorer. |
We are currently walking the tree twice for INTERSECTS and WITHIN queries in ShapeQuery when we can do it in just one pass. Still we need most of the times to visit all documents to remove false positives due to multi-shapes except in the case where all documents up to maxDoc are on the tree.
This pull request refactors that class and tries to improve the strategy for such cases.