LUCENE-8968: Improve performance of WITHIN and DISJOINT queries for Shape queries #857

iverase · 2019-09-05T13:11:29Z

We are currently walking the tree twice for INTERSECTS and WITHIN queries in ShapeQuery when we can do it in just one pass. Still we need most of the times to visit all documents to remove false positives due to multi-shapes except in the case where all documents up to maxDoc are on the tree.
This pull request refactors that class and tries to improve the strategy for such cases.

disjoint queries

jpountz

I don't think we should keep state in the visitor given that the API doesn't guarantee that when visit is called then all documents are within the rectangle that was passed in the previous compare call. This sounds like another reason to look into moving the API from a visitor to a cursor that could walk the tree index freely.

lucene/sandbox/src/java/org/apache/lucene/document/ShapeQuery.java

iverase · 2019-09-05T15:30:42Z

Thanks @jpountz , I agree we should not be keeping a state in the visitor. I updated the PR going back to two passes but using only one dense bitset. In the case where values.getDocCount() == reader.maxDoc() we only need one pass.

iverase · 2019-09-06T08:49:28Z

I have run the performance benchmark defined here which uses around ~13M polygons with a distribution similar to luceneutil geo benchmarks. The result with this approach is better for within and disjoint.

Still performance for WITHIN or DISJOINT queries that match only few documents is not good as it needs to visit most of the documents.

Shape	Operation	M hits/sec Dev	M hits/sec Base	M hits/sec Diff	QPS Dev	QPS Base	QPS Diff	Hit count Dev	Hit count Base	Hit count Diff
point	within	0.00	0.00	0%	368.28	4.16	8759%	0	0	0%
box	within	0.57	0.42	36%	3.89	2.86	36%	32911251	32911251	0%
poly 10	within	0.68	0.49	40%	2.61	1.87	40%	58873224	58873224	0%
polyMedium	within	0.04	0.03	35%	2.52	1.86	35%	522739	522739	0%
polyRussia	within	0.32	0.15	110%	1.32	0.63	110%	244661	244661	0%
point	disjoint	236.15	43.13	448%	17.94	3.28	448%	2962178156	2962178156	0%
box	disjoint	157.47	31.89	394%	12.10	2.45	394%	2929099536	2929099536	0%
poly 10	disjoint	75.69	22.01	244%	5.87	1.71	244%	2903116231	2903116231	0%
polyMedium	disjoint	77.04	22.80	238%	5.86	1.73	238%	433924372	433924372	0%
polyRussia	disjoint	18.74	8.87	111%	1.45	0.69	111%	12920400	12920400	0%
point	intersects	0.00	0.00	-3%	362.28	372.58	-3%	2644	2644	0%
box	intersects	4.63	4.69	-1%	31.47	31.92	-1%	33081264	33081264	0%
poly 10	intersects	2.05	2.13	-3%	7.83	8.11	-3%	59064569	59064569	0%
polyMedium	intersects	0.14	0.13	4%	8.55	8.23	4%	528812	528812	0%
polyRussia	intersects	0.37	0.37	0%	1.52	1.51	0%	244848	244848	0%

jpountz

I haven't looked at tests, do we already have good coverage for the sparse case?

jpountz · 2019-09-09T20:33:50Z

lucene/sandbox/src/java/org/apache/lucene/document/ShapeQuery.java

+        if (hasAnyHits(query, values) == false) {
+          // no hits so we can return
+          return new ConstantScoreScorer(weight, boost, scoreMode, DocIdSetIterator.empty());
+        }


it'd be slightly better to handle this case in the scorer supplier to return a null scorer

# Conflicts: # lucene/sandbox/src/java/org/apache/lucene/document/ShapeQuery.java

scorer supplier.

iverase · 2019-09-10T10:49:38Z

@jpountz thanks! Yes, the random test seems to hit all the possible combinations so I think we are good there. I moved the check for adversarial case to the scorer supplier so we can return a null scorer.

…hape queries (#857)

iverase added 2 commits September 5, 2019 15:00

Refactor ShapeQuery and improve approach to speed up within and

bc4f2c5

disjoint queries

Some formatting nits

a71ebb1

jpountz reviewed Sep 5, 2019

View reviewed changes

lucene/sandbox/src/java/org/apache/lucene/document/ShapeQuery.java Outdated Show resolved Hide resolved

lucene/sandbox/src/java/org/apache/lucene/document/ShapeQuery.java Outdated Show resolved Hide resolved

iverase added 2 commits September 5, 2019 16:22

address first comments

47d4f29

address comments

6de7570

iverase added 3 commits September 5, 2019 17:49

Add TODO

6743d69

New approach handling adversarial case

a24ac62

Add assert

5271075

iverase added 4 commits September 6, 2019 10:58

rename interesects -> sparse

4158eba

simplify further

bd6cbae

docs

b462175

remove unused imports

69add2d

jpountz approved these changes Sep 9, 2019

View reviewed changes

iverase added 3 commits September 10, 2019 12:14

Merge branch 'master' into queryShape

8dab4f1

# Conflicts: # lucene/sandbox/src/java/org/apache/lucene/document/ShapeQuery.java

Check for adversarial case on case of dense case in the

25d7a88

scorer supplier.

Add entry in changes.txt

7b93807

Merge branch 'master' into queryShape

b6bd5ff

iverase merged commit de423ae into apache:master Sep 10, 2019

asfgit pushed a commit that referenced this pull request Sep 10, 2019

LUCENE-8968: Improve performance of WITHIN and DISJOINT queries for S…

5350a05

…hape queries (#857)

iverase deleted the queryShape branch September 10, 2019 12:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-8968: Improve performance of WITHIN and DISJOINT queries for Shape queries #857

LUCENE-8968: Improve performance of WITHIN and DISJOINT queries for Shape queries #857

iverase commented Sep 5, 2019

jpountz left a comment

iverase commented Sep 5, 2019

iverase commented Sep 6, 2019

jpountz left a comment

jpountz Sep 9, 2019

iverase commented Sep 10, 2019

LUCENE-8968: Improve performance of WITHIN and DISJOINT queries for Shape queries #857

LUCENE-8968: Improve performance of WITHIN and DISJOINT queries for Shape queries #857

Conversation

iverase commented Sep 5, 2019

jpountz left a comment

Choose a reason for hiding this comment

iverase commented Sep 5, 2019

iverase commented Sep 6, 2019

jpountz left a comment

Choose a reason for hiding this comment

jpountz Sep 9, 2019

Choose a reason for hiding this comment

iverase commented Sep 10, 2019