Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GEO] Add new geo_bounding_box field type #27902

Closed
wants to merge 12 commits into from

Conversation

nknize
Copy link
Contributor

@nknize nknize commented Dec 19, 2017

This PR adds a new GeoBoundingBoxFieldMapper to expose lucene's LatLonBoundingBoxField type. Creating a new bounding box field is done by using the new field type mapping: geo_bounding_box. The field currently takes no options, though coerce and ignore_malformed options could be added later. Currently the only supported query is the GeoBoundingBoxQueryBuilder enabling users to query bounding boxes with other bounding boxes. Next steps include adding support for GeoPolygonQueryBuilder which would allow users to query bounding boxes with other random polygons.

Current Features

  • geo_bounding_box field mapper does not support indexing boxes that cross the dateline
  • geo_bounding_box query builder does support querying with boxes that cross the dateline (includes boxes that touch the 180 meridian)
  • All INTERSECT, WITHIN, CONTAINS, DISJOINT relations supported except CONTAINS with query boxes that cross the dateline (since dateline crossing boxes are not indexed)

Todo

  • Update docs

@nknize nknize added :Analytics/Geo Indexing, search aggregations of geo points and shapes :Search/Mapping Index mappings, including merging and defining field types :Query DSL v6.3.0 WIP v6.2.0 and removed v6.3.0 labels Dec 19, 2017
@nknize nknize requested a review from jpountz December 19, 2017 16:36
@nknize
Copy link
Contributor Author

nknize commented Dec 19, 2017

@jpountz This still has dateline crossing. Wanted you to have a look before I completely take it out to get your thoughts on whether its as bad as I think. If we can keep dateline crossing out of lucene and achieve it here without any serious performance issues it might be worth it?

@nknize
Copy link
Contributor Author

nknize commented Dec 20, 2017

@elasticmachine please test this

@nknize nknize added the review label Dec 21, 2017
@jpountz
Copy link
Contributor

jpountz commented Dec 22, 2017

Wanted you to have a look before I completely take it out to get your thoughts on whether its as bad as I think.

Looking at the code, I have two concerns. One is that we might create slow queries sometimes due to MUST_NOT clauses that match many documents that also match the SHOULD or MUST clauses. The other one which might be a bit more serious is that I suspect this is either incorrect or potentially confusing in case of multi-valued fields. For instance say we have two boxes that cross the dateline for the same field of the same document, Lucene doesn't allow us to know which western box maps to which eastern box?

@jpountz
Copy link
Contributor

jpountz commented Dec 22, 2017

Another (unrelated) thing that caught my attention is that the PR as-is needs to perform instanceof checks in GeoBoundingBoxQueryBuilder in order to know whether the field is a point or a range. Maybe we should reuse the MappedFieldType.rangeQuery API by passing the topleft point as a from and the bottom right point as a to so that this would be handled at the MappedFieldType level. I could work on that part if you want.

@nknize
Copy link
Contributor Author

nknize commented Jan 8, 2018

I suspect this is either incorrect or potentially confusing in case of multi-valued fields.

I just added a wrap_dateline parameter that puts this control in the hands of the user. If set to true, then multivalue boxes will fail. If set to false (the default) multivalue boxes will pass but a document will fail to index if the user tries to index any boxes that cross the dateline. This way we can support both multivalue bounding boxes for the non-dateline crossing use case, but single value only for more complex boxes that cross the dateline. Let me know what you think about putting this level of control in the hands of the user. I think it strikes a nice balance? And a user should be able to use index aliases to achieve both usecases from a single alias?

we might create slow queries sometimes due to MUST_NOT clauses that match many documents that also match the SHOULD or MUST clauses

Haven't looked at a solution to this problem yet, but I think the wrap_dateline option might help here? We can add a performance warning/tip to the docs that recommend enabling the wrap_dateline parameter for indexes that only index boxes that wrap the dateline and non-dateline wrapped boxes should be indexed in a separate field that does not support dateline wrapping.

Maybe we should reuse the MappedFieldType.rangeQuery API

👍 Will add this next

@nknize nknize force-pushed the feature/GeoBoundingBoxFieldMapper branch from a2acada to be57cd2 Compare January 9, 2018 00:04
@jpountz
Copy link
Contributor

jpountz commented Jan 10, 2018

I'm not a fan of the wrap_dateline parameter, we don't do anything similar with any other field. I'd rather like that we only support intersection or boxes that do not cross the dateline.

My preference would be to reject boxes that cross the dateline and have all relations implemented in a fast way, so that it would be complementary of the upcoming geo_shape field, which will already give you intersection support for bounding boxes that cross the dateline?

@nknize
Copy link
Contributor Author

nknize commented Jan 10, 2018

My preference would be to reject boxes that cross the dateline and have all relations implemented in a fast way, so that it would be complementary of the upcoming geo_shape field...

I am not completely opposed that, but unfortunately we face the same multivalue and dateline issue with geo_shape since we split the shape into two distinct shapes at the dateline. The correct way to implement this is to modify Lucene's BKD codec so that it supports bounding boxes as the key (effectively turning it into an R-Tree). While the wrap_dateline approach is not ideal, it does provide a simple nearterm solution to support dateline wrapping with full relational query capabilities until the BKD codec improvements could be added.

we don't do anything similar with any other field.

I sort of viewed this as a parallel to the problem with phrase queries over multivalue text fields. There we use a position_increment_gap parameter to prevent phrase queries from matching across values. While the wrap_dateline parameter prevents multivalue for dateline crossing altogether, perhaps there is a solution that is more in line with the position_increment_gap concept; or maybe implementing something in Lucene like TermVector.WITH_POSITIONS_OFFSETS for multivalue shapes/boxes that cross the dateline? This may be a more elegant solution than modifying the entire BKD tree to support bounding boxes? I'm brainstorming now, thoughts?

@jpountz
Copy link
Contributor

jpountz commented Jan 11, 2018

If we want to support all relations, I think we need to rework the encoding. Trying to hack it on top of the existing LatLonBoundingBox, which only supports INTERSECTS is going to be painful in the long term I'm afraid? For instance we discussed doubling the encoding space for the longitude at some point, I think this is one option? From what I remember we mostly didn't do it in order to have the same encoding as LatLonPoint.

Some questions that this discussion triggered for me:

  • If the upcoming geo_shape/geo field supports the same query capabilities as geo_bounding_box, should we add geo_bounding_box or just wait for shape support?
  • How likely are users to need relations other than INTERSECTS?

I'm pretty open regarding where we are moving this forward, however one thing that I care about is that we do end up with exposing some features that perform linear scans under the hood.

This commit adds a new GeoBoundingBoxFieldMapper to expose lucene's LatLonBoundingBoxField type. Creating a new bounding box field is done by using the new field type mapping: `geo_bounding_box`. The field currently takes no options, though a coerce and ignore_malformed options could be added later. Currently the only supported query is the GeoBoundingBoxQueryBuilder enabling users to query bounding boxes with other bounding boxes. Next steps include adding support for GeoPolygonQueryBuilder which would allow users to query bounding boxes with other random polygons.
…ssing boxes, or single value for dateline crossing boxes
@nknize nknize force-pushed the feature/GeoBoundingBoxFieldMapper branch from 4dc836a to cc11a9f Compare January 12, 2018 02:59
@nknize nknize removed the WIP label Jan 12, 2018
@nknize
Copy link
Contributor Author

nknize commented Jan 12, 2018

@jpountz I spoke w/ @clintongormley about this with respect to the roadmap. Per all discussions I've gone ahead and updated the PR as follows:

  • removed wrap_dateline parameter
  • do not support indexing bounding boxes that cross the dateline
  • support indexing multi-value bounding boxes
  • support querying with boxes that wrap the dateline (e.g., INTERSECTS a query box that crosses the east/west hemisphere)
  • support full relation queries except CONTAINS with query boxes that wrap the dateline (since we no longer support indexing boxes that cross the dateline this query would return an empty set)

Aside from updating docs this PR is ready for another review to identify and clean up any gross or ill performing code.

@nknize
Copy link
Contributor Author

nknize commented Jan 15, 2018

@clintongormley We talked about this at the meeting this morning. The big question is whether or not geo_bounding_box is something that is worth it if there's a possibility it will be deprecated and removed if/when BKD is extended to support an R-Tree range structure. I have an open Lucene issue that I bumped for some discussion.

After thinking about this a little more: I can see the concern for adding a new bounding box field, only to remove it in a future version. That being said, I'm not sure what kind of response I'm going to get from the lucene community and/or how long it will take to settle on a design. So is this PR worth it as an interim stop-gap? I think @jpountz feels its not? And I don't know that anyone other than our hardcore GIS users would be upset if it doesn't land in 6.2. So it just boils down to answering whether or not its okay to add a new field if there's a possibility it will be removed in the future. Is there a precedence for something like this?

@clintongormley what are your thoughts? I'm okay postponing this until we can decide the future of a spatial codec in Lucene. That will tell us whether or not this field will be deprecated/removed in a future release and we can decide the fate of geo_bounding_box at that point.

@colings86 colings86 added v6.3.0 and removed v6.2.0 labels Jan 22, 2018
@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Query DSL labels Feb 14, 2018
@nknize
Copy link
Contributor Author

nknize commented Mar 26, 2018

Closing since we decided against a geo_bounding_box field in favor of BKD backed geo_shape.

@nknize nknize closed this Mar 26, 2018
@jpountz jpountz removed the v6.3.0 label Jun 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Geo Indexing, search aggregations of geo points and shapes :Search/Mapping Index mappings, including merging and defining field types :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants