Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Add aggregation support for geo_shape fields #50834
This PR introduces doc-values support for
feature-branch PRs still in-flight or not yet up that are blocking this PR:
TODOs that are not blocking this PR:
This commit introduces a new data-structure for reading and writing EdgeTrees that write/read serialized versions of the tree. This tree is the basis of Polygon trees that will contain representation of any holes in the more complex polygon
The GeometryTree represent an Elastisearch Geometry object. This includes collections like MultiPoint and GeometryCollection. For the initial implementation, only polygons without holes are supported. In a follow-up PR, the GeometryTree will be the object that interacts with doc-value reading and writing.
- min and max values of coordinates were difficult to track, this fixes that by introducing a new Extent object - Instead of re-wrapping ByteRef into a StreamInput, a stream input is made once - a new getExtent() method is introduced for use by aggregations like geo_bounds - re-use bounding-box containment checks
* Add GeometryTree support for point/multipoint This commit adds support for MultiPoint and Point shapes to be stored in GeometryTree. To represent the collection of points, a KDbush is used, which is a sorted array sorted recursively by alternating dimensions x/y. This work is inspired by https://github.com/mourner/kdbush The purpose of this reader is to check whether any subset of the points in the kd-tree are contained within the bounding-box query. * unify reader interface and cleanup multipoint usage * respond to review
The main change here is that edge-trees originally checked whether the queried extent could be contained within its shape. Since line-strings have no inner boundaries, this check is not useful, the line crosses check + extent-check-bounds is sufficient.
To aid in keeping aggregation logic as simple as possible, the MultiGeoPointValues object that returns GeoPoint values for fields from doc-values is updated to return implementations of a geo-value object that can represent either points or shapes.
After lots of evaluation and benchmarking, it seems the TriangleTree is preferred over the GeometryTree for the following reasons - simpler to model all shapes as one tree instead of specializing and optimizing for edge-cases. (GeometryCollection of Points is stored the same as a MultiPoint) - Although there are more situations where the EdgeTree out-performs the TriangleTree, the times it is faster it is faster by a lot because it is these times that the EdgeTree must traverse O(n) of the edges to determine a queried tile is outside of the shape. https://gist.github.com/talevy/f06ef43be1e97afb1ee53f25b980a4a0 Downsides of the Triangle when compared to Geometry Tree - it is not possible to reverse-engineer into the original geometry for use in scripting - Points cannot be stored as compactly as in the GeometryTree - not faster on every single relate query
This commit serializes the ShapeType of the indexed geometry. The ShapeType can be useful for other future features. For one thing: #49887 depends on the ability to determine what the highest dimensional shape is for centroid calculations. GeometryCollection is reduced to the sub-shape of the highest dimension relates #37206.
This PR implements proper centroid calculations of geometries according to the definition defined in #49887. To compute things correctly, an additional variable encoded long representing the total weight for the centroid of the geometry in a tree. This weight is always positive. Some tests are fixed, as they did not have valid geometries. closes #49887.