Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add aggregation support for geo_shape fields #50834

Open
wants to merge 122 commits into
base: master
from
Open

Conversation

@talevy
Copy link
Contributor

talevy commented Jan 10, 2020

This PR introduces doc-values support for geo_shape fields.
This includes aggregation support for the following existing
geo_point aggregations:

  • geohash_grid
  • geotile_grid
  • geo_centroid
  • geo_bounds

The geo_distance aggregation is the only one not supported in this PR.
Scripting support is also not implemented.

feature-branch PRs still in-flight or not yet up that are blocking this PR:

  • add doc_values mapping option to geo_shape field mapping (#47519)
  • Add eager max-bucket-count check for geo_shape grid aggregation (#50742
  • Adds support for geo-bounds filtering in geo-grid aggregations
    • PR against master for existing geo_point implementation (#50002)
    • PR against geoshape_doc_values

TODOs that are not blocking this PR:

  • Documentation cleanup and examples added
  • explicit tests verifying behavior of geo_shapes in scripts (need to verify necessity of this)

Closes #37206.

talevy and others added 30 commits May 21, 2019
This commit introduces a new data-structure
for reading and writing EdgeTrees that write/read
serialized versions of the tree.

This tree is the basis of Polygon trees that will contain representation
of any holes in the more complex polygon
The GeometryTree represent an Elastisearch Geometry
object. This includes collections like MultiPoint
and GeometryCollection.

For the initial implementation, only polygons without
holes are supported.

In a follow-up PR, the GeometryTree will be the object that
interacts with doc-value reading and writing.
- min and max values of coordinates were difficult to
  track, this fixes that by introducing a new Extent
  object
- Instead of re-wrapping ByteRef into a StreamInput, a stream
  input is made once
- a new getExtent() method is introduced for use by aggregations like
  geo_bounds
- re-use bounding-box containment checks
* Add GeometryTree support for point/multipoint

This commit adds support for MultiPoint and Point
shapes to be stored in GeometryTree.

To represent the collection of points, a KDbush is used, which is
a sorted array sorted recursively by alternating dimensions x/y.
This work is inspired by https://github.com/mourner/kdbush

The purpose of this reader is to check whether any subset of the
points in the kd-tree are contained within the bounding-box query.

* unify reader interface and cleanup multipoint usage

* respond to review
The main change here is that edge-trees originally
checked whether the queried extent could be contained
within its shape. Since line-strings have no inner boundaries,
this check is not useful, the line crosses check + extent-check-bounds
is sufficient.
To aid in keeping aggregation logic as simple as possible,
the MultiGeoPointValues object that returns GeoPoint values
for fields from doc-values is updated to return implementations
of a geo-value object that can represent either points or shapes.
talevy and others added 15 commits Dec 10, 2019
After lots of evaluation and benchmarking, it seems
the TriangleTree is preferred over the GeometryTree
for the following reasons

- simpler to model all shapes as one tree instead of
  specializing and optimizing for edge-cases. (GeometryCollection
  of Points is stored the same as a MultiPoint)
- Although there are more situations where the EdgeTree out-performs
  the TriangleTree, the times it is faster it is faster by a lot
  because it is these times that the EdgeTree must traverse O(n) of
  the edges to determine a queried tile is outside of the shape.
  https://gist.github.com/talevy/f06ef43be1e97afb1ee53f25b980a4a0

Downsides of the Triangle when compared to Geometry Tree

- it is not possible to reverse-engineer into the original geometry
  for use in scripting
- Points cannot be stored as compactly as in the GeometryTree
- not faster on every single relate query
This commit serializes the ShapeType of the indexed
geometry. The ShapeType can be useful for other future
features. For one thing: #49887 depends on the ability
to determine what the highest dimensional shape is
for centroid calculations.

GeometryCollection is reduced to the sub-shape of the
highest dimension

relates #37206.
This PR implements proper centroid calculations of geometries according to the 
definition defined in #49887. To compute things correctly, an additional variable 
encoded long representing the total weight for the centroid of the geometry in 
a tree. This weight is always positive. Some tests are fixed, as they did not 
have valid geometries.

closes #49887.
@elasticmachine

This comment has been minimized.

Copy link
Collaborator

elasticmachine commented Jan 10, 2020

Pinging @elastic/es-analytics-geo (:Analytics/Geo)

@talevy talevy added WIP >feature and removed >enhancement labels Jan 10, 2020
Copy link
Contributor

jpountz left a comment

I quickly scanned through the pull request and left some questions, but this looks pretty good overall!

GEOMETRYCOLLECTION_LINES, // highest-dimensional shapes are Lines
GEOMETRYCOLLECTION_POLYGONS; // highest-dimensional shapes are Polygons

public static Comparator<DimensionalShapeType> COMPARATOR = Comparator.comparingInt(DimensionalShapeType::centroidDimension);

This comment has been minimized.

Copy link
@jpountz

jpountz Jan 14, 2020

Contributor

Should we make it the natural order of DimensionalShapeType by reordering enum constants?

This comment has been minimized.

Copy link
@talevy

talevy Jan 14, 2020

Author Contributor

I considered it, but thought that type of implicit relationship could be dangerous and unclear to a reader. If you think that concern is overblown, then I am happy to re-use the enum's ordinal here and document the nature of the ordering

This comment has been minimized.

Copy link
@jpountz

jpountz Jan 14, 2020

Contributor

I think the thing that I found a bit trappy is that this comparator is called just COMPARATOR, which suggests it is the natural order for this enum, yet Java enums also implement Comparable and in that case it would be a different order. I suggested reordering constants, but renaming the comparator would work for me too.

This comment has been minimized.

Copy link
@talevy

talevy Jan 14, 2020

Author Contributor

makes sense. I agree it isn't done the best way. I'll address this. thanks!

posLeft = 0;
negRight = 0;
break;
default:

This comment has been minimized.

Copy link
@jpountz

jpountz Jan 14, 2020

Contributor

can you make it an explicit case (ALL_SET) and fail if type is an unknown byte to better detect corruptions of the stream?

private static final byte POSITIVE_SET = 1;
private static final byte NEGATIVE_SET = 2;
private static final byte CROSSES_LAT_AXIS = 3;
private static final byte ALL_SET = 4;

This comment has been minimized.

Copy link
@jpountz

jpountz Jan 14, 2020

Contributor

I wonder whether this is the right characterization of the different states we have. An alternative would be to record the number of longitude ranges we have, which could be 0, 1 or 2?

case NONE_SET : break;
case POSITIVE_SET:
output.writeVInt(this.posRight);
output.writeVLong((long) this.posRight - this.posLeft);

This comment has been minimized.

Copy link
@jpountz

jpountz Jan 14, 2020

Contributor

is there a reason why we are encoding posRight and posRight - posLeft instead of posLeft and posRight - posLeft? posLeft is always going to be less than posRight so the VInt takes less space?

This comment has been minimized.

Copy link
@iverase

iverase Jan 14, 2020

Contributor

+1 that is my fault

output.writeVLong((long) this.posRight - this.posLeft);
break;
case NEGATIVE_SET:
output.writeInt(this.negRight);

This comment has been minimized.

Copy link
@jpountz

jpountz Jan 14, 2020

Contributor

should it be symmetric with POSITIVE_SET and do writeVInt(-this.negRight)?

This comment has been minimized.

Copy link
@iverase

iverase Jan 14, 2020

Contributor

+1 that is my fault. Didn't thought about that

if (geoValues.advanceExact(docId)) {
ValuesSourceType vs = geoValues.valuesSourceType();
MultiGeoValues.GeoValue target = geoValues.nextValue();
// TODO(talevy): determine reasonable circuit-breaker here

This comment has been minimized.

Copy link
@jpountz

jpountz Jan 14, 2020

Contributor

do we plan to handle this before merging or can it wait?

This comment has been minimized.

Copy link
@talevy

talevy Jan 14, 2020

Author Contributor

I consider this to be very harmful to a user since geo-data is often unwieldy and could easily result in someone requesting to tile a continent-sized shape using very high precision and causing dangerous memory pressure given how we are leveraging the long[] array to temporarily hold buckets.

I've opened #50720 to address this by accounting for the long[]'s bytes during bucket generation.

This comment has been minimized.

Copy link
@talevy

talevy Jan 14, 2020

Author Contributor

oh and to answer the question: I hope to finish that PR before merging since it feels critical to me

double maxY = Geohash.decodeLatitude(max);
double maxX = Geohash.decodeLongitude(max);
for (double i = minX; i <= maxX; i += Geohash.lonWidthInDegrees(precision)) {
for (double j = minY; j <= maxY; j += Geohash.latHeightInDegrees(precision)) {

This comment has been minimized.

Copy link
@jpountz

jpountz Jan 14, 2020

Contributor

I wonder whether accuracy issues of floating-point additions could cause bugs here, is it possible to work directly on the encoded space instead?

talevy and others added 5 commits Jan 14, 2020
Co-Authored-By: Adrien Grand <jpountz@gmail.com>
Co-Authored-By: Adrien Grand <jpountz@gmail.com>
4^(x) can be re-written as 2^(2*x) to avoid usage of the slower Math#pow method

Co-Authored-By: Adrien Grand <jpountz@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.