Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-8968: Improve performance of WITHIN and DISJOINT queries for Shape queries #857

Merged
merged 15 commits into from Sep 10, 2019

Conversation

iverase
Copy link
Contributor

@iverase iverase commented Sep 5, 2019

We are currently walking the tree twice for INTERSECTS and WITHIN queries in ShapeQuery when we can do it in just one pass. Still we need most of the times to visit all documents to remove false positives due to multi-shapes except in the case where all documents up to maxDoc are on the tree.
This pull request refactors that class and tries to improve the strategy for such cases.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should keep state in the visitor given that the API doesn't guarantee that when visit is called then all documents are within the rectangle that was passed in the previous compare call. This sounds like another reason to look into moving the API from a visitor to a cursor that could walk the tree index freely.

@iverase
Copy link
Contributor Author

iverase commented Sep 5, 2019

Thanks @jpountz , I agree we should not be keeping a state in the visitor. I updated the PR going back to two passes but using only one dense bitset. In the case where values.getDocCount() == reader.maxDoc() we only need one pass.

@iverase
Copy link
Contributor Author

iverase commented Sep 6, 2019

I have run the performance benchmark defined here which uses around ~13M polygons with a distribution similar to luceneutil geo benchmarks. The result with this approach is better for within and disjoint.

Still performance for WITHIN or DISJOINT queries that match only few documents is not good as it needs to visit most of the documents.

Shape Operation M hits/sec Dev M hits/sec Base M hits/sec Diff QPS Dev QPS Base QPS Diff Hit count Dev Hit count Base Hit count Diff
point within 0.00 0.00 0% 368.28 4.16 8759% 0 0 0%
box within 0.57 0.42 36% 3.89 2.86 36% 32911251 32911251 0%
poly 10 within 0.68 0.49 40% 2.61 1.87 40% 58873224 58873224 0%
polyMedium within 0.04 0.03 35% 2.52 1.86 35% 522739 522739 0%
polyRussia within 0.32 0.15 110% 1.32 0.63 110% 244661 244661 0%
point disjoint 236.15 43.13 448% 17.94 3.28 448% 2962178156 2962178156 0%
box disjoint 157.47 31.89 394% 12.10 2.45 394% 2929099536 2929099536 0%
poly 10 disjoint 75.69 22.01 244% 5.87 1.71 244% 2903116231 2903116231 0%
polyMedium disjoint 77.04 22.80 238% 5.86 1.73 238% 433924372 433924372 0%
polyRussia disjoint 18.74 8.87 111% 1.45 0.69 111% 12920400 12920400 0%
point intersects 0.00 0.00 -3% 362.28 372.58 -3% 2644 2644 0%
box intersects 4.63 4.69 -1% 31.47 31.92 -1% 33081264 33081264 0%
poly 10 intersects 2.05 2.13 -3% 7.83 8.11 -3% 59064569 59064569 0%
polyMedium intersects 0.14 0.13 4% 8.55 8.23 4% 528812 528812 0%
polyRussia intersects 0.37 0.37 0% 1.52 1.51 0% 244848 244848 0%

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at tests, do we already have good coverage for the sparse case?

if (hasAnyHits(query, values) == false) {
// no hits so we can return
return new ConstantScoreScorer(weight, boost, scoreMode, DocIdSetIterator.empty());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'd be slightly better to handle this case in the scorer supplier to return a null scorer

# Conflicts:
#	lucene/sandbox/src/java/org/apache/lucene/document/ShapeQuery.java
@iverase
Copy link
Contributor Author

iverase commented Sep 10, 2019

@jpountz thanks! Yes, the random test seems to hit all the possible combinations so I think we are good there. I moved the check for adversarial case to the scorer supplier so we can return a null scorer.

@iverase iverase merged commit de423ae into apache:master Sep 10, 2019
asfgit pushed a commit that referenced this pull request Sep 10, 2019
@iverase iverase deleted the queryShape branch September 10, 2019 12:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants