Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GeoShape intersects operation returns false positives #2361

Closed
IamJeffG opened this issue Oct 27, 2012 · 18 comments · Fixed by #10652
Closed

GeoShape intersects operation returns false positives #2361

IamJeffG opened this issue Oct 27, 2012 · 18 comments · Fixed by #10652
Labels
:Analytics/Geo Indexing, search aggregations of geo points and shapes >enhancement help wanted adoptme

Comments

@IamJeffG
Copy link
Contributor

The geoShape logic produces false positives in results returned from "intersects" queries and filters. I presume (have not tested) that "contains" also produces false positives and "disjoint" can fail to return true matches. This is because the queries/filters check only if the query shape and document shape have grid locations that overlap in the SpatialPrefixTree implementation, without actually comparing the query shape with the document shape.

In theory this bug can be worked around by increasing the depth of the SpatialPrefixTree (maxLevels in the constructor). In practice both indexing shapes and querying shapes is a O(4^n) algorithm in both space and time, where n is the tree's depth, so highly accurate results and low-latency or limited RAM are mutually exclusive. 100% accurate querying at all resolutions requires an infinitely deep tree, under the current implementation, which is based solely on gridding.

One solution is to use the existing implementation to find candidate docs, and verify each of those docs by pulling back its geometry and comparing to the query using JTS. This allows keeping spatial index trees relatively shallow and efficient. I'm happy to implement this and share the pull request, unless you guys have other ideas. I have questions:

  • does such a patch go to the elasticsearch project, or upstream (Lucene?)
  • documentation says we don't support storage of geometry fields, except in the _source. Where would I start to store the geometry (maybe as WKB) itself in the index (not just its tree nodes)?

Here's how to replicate the bug:

   private static final SpatialPrefixTree QUAD_PREFIX_TREE =
            new QuadPrefixTree(GeoShapeConstants.SPATIAL_CONTEXT,
            QuadPrefixTree.DEFAULT_MAX_LEVELS);
    strategy  = new TermQueryPrefixTreeStrategy(
            new FieldMapper.Names("shape"), QUAD_PREFIX_TREE, 0.0);

    IndexWriter writer = new IndexWriter(directory,
            new IndexWriterConfig(Version.LUCENE_36, new KeywordAnalyzer()));
    writer.addDocument(newDocument("5",
                newPolygon().point(-92.34, 37.56).point(-92.34, 37.55)
                .point(-92.32, 37.55).point(-92.32, 37.56).point(-92.34, 37.56).build()));

    Point point = new Point(-92.319, 37.54);
    Filter filter = STRATEGY.createIntersectsFilter(point);

    // BUG: the doc and query do not intersect, but this returns 1 hit
    indexSearcher.search(new MatchAllDocsQuery(), filter, 10);
@chrismale
Copy link
Contributor

So the concern here is that the algorithms are not precise enough because they rely totally on the grid hash overlap?

This is actually partially by design. Iterating over a large set of documents and comparing them using JTS will heavily degrade performance. The current behaviour is similar to the effect that stemming has on free text searches. There are countless examples where stemming results in documents being matched that don't really line up with the user's intentions.

With that said, we are considering bringing in an alternative algorithm based on Lucene's RecursivePrefixTreeStrategy which theoretically provides a little more precision at the edges of shapes, but it doesn't exceed the SpatialPrefixTree levels.

Have you done any benchmarking of the performance of using JTS? I have found even just the object creation is expensive. If we can find a way to make the performance manageable, I can envisage maybe having an option to enable the precise filtering which is disabled by default.

does such a patch go to the elasticsearch project, or upstream (Lucene?)

ES uses its own version of the Lucene spatial code so we can explore the problem here and then consider contributing fixes upstream. However due to the licensing of JTS, Lucene cannot directly use it.

documentation says we don't support storage of geometry fields, except in the _source. Where would I start to store the geometry (maybe as WKB) itself in the index (not just its tree nodes)?

Good question. I have worked on a standalone WKT parser which we could use but WKB would probably be more compact. Given a parser, we would need to change the Mapping code fro GeoShape to add an additional field which would index the parsed shape. Something along these lines is already done in the GeoPoint codebase.

@IamJeffG
Copy link
Contributor Author

Chris,
You make a good point that the false-positives are analogous to stemming in free text, but just as elasticsearch also supports exact matches (via non-analyzed term query) when desired, I'm proposing it be able to do the same for geo-shape. it's true there will be a performance hit, possibly severe, so we should make this extra validation optional.

That said, I benchmark WKB deserialization to be extremely fast: 3% of GeoJSON deser time and 9% of WKT deser time, using the JTS libraries. Once instantiated, geometry comparisons can be made 58% faster when using a JTS PreparedGeometry.

I'm going to attempt a patch; I'm new to the codebase but here's what I'm thinking:
At index time, GeoShapeFieldMapper needs to use a MultiFieldMapper (or similar), indexing both the prefix tree terms (as now), plus a BinaryMapper for the WKB, which has storage enabled. It seems the FieldData and DocFieldData interfaces are used only at query time, and they would provide the stored WKB back to a GeoShapeFilter for polygon comparison, is this right?

Thanks for getting back!

@chrismale
Copy link
Contributor

You make a good point that the false-positives are analogous to stemming in free text, but just as elasticsearch also supports exact matches (via non-analyzed term query) when desired, I'm proposing it be able to do the same for geo-shape. it's true there will be a performance hit, possibly severe, so we should make this extra validation optional.

I appreciate the argument, just the severity concerns me.

That said, I benchmark WKB deserialization to be extremely fast: 3% of GeoJSON deser time and 9% of WKT deser time, using the JTS libraries. Once instantiated, geometry comparisons can be made 58% faster when using a JTS PreparedGeometry.

Are you able to benchmark the comparison of 10,000 Geometrys? Just to give me an idea of the severity.

I'm going to attempt a patch; I'm new to the codebase but here's what I'm thinking:
At index time, GeoShapeFieldMapper needs to use a MultiFieldMapper (or similar), indexing both the prefix tree terms (as now), plus a BinaryMapper for the WKB, which has storage enabled. It seems the FieldData and DocFieldData interfaces are used only at query time, and they would provide the stored WKB back to a GeoShapeFilter for polygon comparison, is this right?

I would take a look at GeoPointFieldMapper as an example. It creates a number of different fields. Equally, looking at GeoPolygonFilterParser will give an idea of how to use FieldData (the FilterParser uses a Filter which interacts with FieldData).

@dsmiley
Copy link
Contributor

dsmiley commented Oct 31, 2012

Interesting conversation.

Lucene spatial's RecursivePrefixTreeStrategy (RPTS) would be great here. It could be extended such that at the finder detail levels of intersection at the edges, it then consults the JTS shape for an exact hit. This works well because this won't even happen for shapes that are clearly outside or clearly inside the query shape -- it's only at the intersecting edges (or very close to intersecting) where this would happen. I don't have this requirement on my projects so I haven't bothered to do it. RPTS will also find matches where the query shape is inside much larger indexed shapes -- a limitation to ES's spatial.

@dsmiley
Copy link
Contributor

dsmiley commented Oct 31, 2012

I was thinking about this a bit more and I can see that with the combination of our grid system we can do even better when paired with JTS. Instead of verifying intersection (or lack thereof) with JTS at the edges, it could be consulted only when the shapes come closed but indeterminate wether there is overlap. If there is clear overlap (i.e. the indexed shape has grids inside and outside the query shape) we already know there is an intersection and need not consult JTS for those shapes. I suspect an algorithm like this would improve massively on any other approaches, as this would forgoe the need to consult JTS in I think the majority of cases, and in those cases limit the # shapes to look at to a small #.

@IamJeffG
Copy link
Contributor Author

Thanks for the ideas David. Sounds like a good plan for efficiency. I've also noticed that in ES queries with very large spatial extent or indexing a doc with a very large spatial extent is quite infeasible due to ES's prefix trees indexing every single grid under the shape; the hack is to use a more shallow tree, but this is untenable when different docs have vastly different extents. I wonder if RPTS would solve this too (but that's for another bug report).

Meantime, where can an overview of RPTS be found? All I am finding with Google is the source code.

@dsmiley
Copy link
Contributor

dsmiley commented Oct 31, 2012

The source code for the Lucene filter is where the search algorithm is, and that's what is pertinent here: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/spatial/src/java/org/apache/lucene/spatial/prefix/RecursivePrefixTreeFilter.java?view=markup

There isn't much code to it, although it's complex. Arguably the most complex aspect of the code is that it switches its algorithm from recursive decomposition to brute-force scan once the detail level gets high enough.

I've been putting a little thought into this filter and I can see it getting overhauled a bit so that it can do contains, within, disjoint, (not just intersects), plus also supporting a callback or related mechanism to handle ambiguous intersection, which is where you could consult JTS in a 2nd phase.

RPTS and ES's term query based search both inherit the same indexing approach; there's no difference there. With the types of changes we're discussing here about using JTS as a fallback for high precision when there is ambiguity, it would not be necessary to index high-precision grids; I'd do course, which means more scalable indexing, with the trade-off of higher likelihood of consulting a serialized shape at search time.

@chrismale
Copy link
Contributor

Thanks for providing your insight David.

Couple of thoughts:

  • We will definitely bring RecurisvePrefix into ES as an option, probably as part of 0.21. It definitely has the potential to help improve the accuracy along the edges of shapes.
  • I remain really unsure about consulting JTS, even if it's now with a smaller % of shapes. I worry about the memory consumption and the impact on query performance. I think we really need to do some benchmarking and get an idea of what impact this is going to have on search time.

@IamJeffG
Copy link
Contributor Author

IamJeffG commented Nov 1, 2012

I benchmarked intersects queries from a single reference shape (as a JTS PreparedGeometry for efficiency) against 10,000 other Geometrys. My laptop can do this right around 15ms. This is if Geometry instances are already instantiated.

To also include object instantiation, deserializing from WKB, the total time for 10,000 instantiations and comparisons is 35-45ms.

These are not stupid simple shapes either: I was using mostly MultiPolygons with mean 29 vertices per Geometry.

@chrismale
Copy link
Contributor

That sounds very manageable. Do you know how much memory say, 100,000 Geometrys consumes?

@IamJeffG
Copy link
Contributor Author

I think the memory consumption of 100k Geometrys will vary wildly depending on the complexity of the shape. In most cases, though, I don't think it's practical to expect to store or cache them all in memory. There will probably be a disk hit to do the extra validation, but isn't there already a disk hit to read the indexed geohash tiles? By chance can we pull back the WKB from the index at the same time?

Is it true that FieldData is the appropriate (or only?) method to access the BinaryFieldMapper data I store in the index? I ask because http://www.elasticsearch.org/guide/reference/index-modules/cache.html says FieldData is used for sorting or caching, but doesn't mention filtering like I'm doing here.

If so, is the MultiValueByteFieldData an efficient way to retrieve the stored byte array from the index, or is it better to implement a custom GeoShapeFieldData, similar to the extant GeoPointFieldData? (and why?)

@dsmiley
Copy link
Contributor

dsmiley commented Nov 27, 2012

RE "but isn't there already a disk hit to read the indexed geohash tiles?"

That's apples & oranges. A geohash is a simple shape with a fairly compact representation (but could be improved). The number of geohash tiles that will be examined has a roughly fixed upper bound given the distErrPct (shape approximation statistic). The disk seek()'s are strictly increasing across the shape, and the geohashes will tend to be co-located and thus increasing the effectiveness of the OS's disk cache. On the other hand, looking up shape data per-document is on the order of the number of affected documents (could be a small or big number), and it is a random disk seek per document that is unlikely to be co-located with nearby shapes. That last point is solvable using sorted DocValues and a geohash centroid prefix to the bytes.

@IamJeffG
Copy link
Contributor Author

Makes perfect sense, thanks David. Chris, can you confirm I'm on the right track around FieldData? I'm following the GeoPolygonFilterParser, but I think what it does is overkill for our needs around this issue.

@IamJeffG
Copy link
Contributor Author

Followup after playing with the code a bit:
It seems that a FieldDataCache cannot read stored-but-not-indexed fields, such as are produced by the BinaryFieldMapper [1]. What's the best way around this?
I can hack around this and make further progress by using a StringFieldMapper for a base-64-encoded WKB, but I hate to have that extra overhead in both encoding time and storage space.

[1] http://www.elasticsearch.org/guide/reference/mapping/core-types.html "Binary types: The field is always stored and not indexed at all."


Update: the limitation that BinaryFieldMappers were not indexed has been removed by 549900a. However this merely defers the problem to instantiation of the Field instance:

java.lang.IllegalArgumentException: Fields with BytesRef values cannot be indexed
           at org.apache.lucene.document.Field.<init>(Field.java:222)

IamJeffG pushed a commit to TheClimateCorporation/elasticsearch that referenced this issue Nov 30, 2012
 - requisite for issue elastic#2361
 - uses Base64 encoding with a StringFieldMapper: want to use a BinaryFieldMapper
   instead, but get
   java.lang.IllegalArgumentException: Fields with BytesRef values cannot be indexed
       at org.apache.lucene.document.Field.<init>(Field.java:222)
IamJeffG pushed a commit to TheClimateCorporation/elasticsearch that referenced this issue Dec 3, 2012
@IamJeffG
Copy link
Contributor Author

IamJeffG commented Dec 3, 2012

Sorry for all the stale commits up above. Take a look at pull request 2460. I have two doubts about the efficiency of its implementation, so please help me address these. I'd prefer we don't merge 2460, but I'm happy to iterate based on our conversation and followup with a new pull request.

@clintongormley
Copy link

According to #2803 (comment) this issue should be fixed by Lucene 4.7's https://issues.apache.org/jira/browse/LUCENE-5408. What do we need to do to benefit from that?

@IamJeffG
Copy link
Contributor Author

IamJeffG commented Jul 8, 2014

From a long-ago discussion with @dsmiley, I believe this requires the following two changes in Elasticsearch:

  1. Index time: Mapping for geo_shape type objects should take boolean flag whether this field should support verified matches. If so, we write the indexableFields from both the PrefixTreeStrategy and the SerializedDVStrategy. The latter serializes the doc's geometry in a BinaryDocValuesField (the DocValues should be disk-resident since values might be large).
  2. A geo_shape query/filter, if "verified matches" are desired, will explicitly issue two filters/queries to Lucene:
    a) The PrefixTreeStrategy as we currently do, though perhaps with less precision (as a cachable filter if possible).
    b) In the same request, use the new SerializedDVStrategy to do exact polygon verification between the query shape and the DocValues of the remaining docs.

The SerializedDVStrategy itself has no index, it's O(N) for N documents: it should always be used in conjunction with another filter, specifically after the PrefixTreeStrategy. Since it's slow, we'd also like to force it dead-last in the query tree; for example Solr has a mechanism called "PostFilter" to do this, I'm not sure what Elasticsearch provides.

@dsmiley
Copy link
Contributor

dsmiley commented Jul 10, 2014

I think there are two approaches to do this that are philosophically a little different, but they don't have to be mutually exclusive:

(A) Augment geo_shape with this new capability. In this concept, one mapping/type supports everything, with some configuration flags to turn unneeded things off. Heck, it might eventually incorporate BBoxSpatialStrategy (newly committed yesterday) too. One mapping/type but underlying it would be potentially multiple "SpatialStrategies" in use and multiple fields. This would probably be the most user-friendly because the user is less exposed to some nuts & bolts, but it hides some underlying things that may make it harder to configure each strategy a little differently. There is an optimization easily done here too in which you can easily ensure that the underlying polygon is parsed once, since it's managed by one mapper (index time) and query.

Ideally, in this approach, the GeoShapeFilterParser would be able to somehow modify the overall query to use FilteredQuery with QUERY_FIRST_FILTER_STRATEGY with the slow SerializedDVStrategy providing the (slow) filter. If this can't be done or is too awkward, then it may have to settle for only wrapping the spatial PrefixTree (RPT) query instead of all other queries in play (e.g. keyword & other filters), which means it'll definitely be less efficient. If none of this is done then the user must provide a query with multiple spatial clauses (the RPT early and SerializedDV last), and if so I argue the one-mapper to rule them all abstraction breaks down.

(B) Enhance GeoShapeFieldMapper to support other SpatialStrategies (only one instance). The user would then use two spatial fields, one for the RPT index, one for serialized geometry. This is definitely an easier approach and should offer decent control by the user as to how to compose the query, but it requires the user to know what they are doing more. And ideally ES has mechanisms to share a parsed Shape between fields on indexing and search so that both Mappers don't have to re-construct the Shapes from JSON.

@clintongormley clintongormley added :Analytics/Geo Indexing, search aggregations of geo points and shapes and removed good first issue low hanging fruit labels Nov 10, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Geo Indexing, search aggregations of geo points and shapes >enhancement help wanted adoptme
Projects
None yet
4 participants