Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default geo_shapes to quadtree #25833

Closed
gmoskovicz opened this Issue Jul 21, 2017 · 10 comments

Comments

Projects
None yet
6 participants
@gmoskovicz
Copy link
Contributor

gmoskovicz commented Jul 21, 2017

We have been discussing this for a while and it seems to be that quadtree(s) are much faster to be indexed. Also, most people don't realize that setting precision will then default distance_error_pct to zero, also causing the indexing times of geo shapes to be very slow.

My proposal here is to:

  1. Change the default tree to quadtree .
  2. Default distance_error_pct to 0.001 when precision is set.

CC @nknize

@nknize

This comment has been minimized.

Copy link
Member

nknize commented Jul 21, 2017

Thanks @gmoskovicz

So we are trying to get rid of quadtree, but yet it is still the fastest and should be the default?

I probably confused you a bit...cause English :) Quadtree is a general algorithm for rasterizing geoshapes. How those quad raster cells are encoded in lucene's terms dictionary depends on the "tree" parameter which tells Lucene to encode either as a geohash string or packed binary Morton code. Packed morton encoding (selected by setting "tree" : "quadtree" is faster (index and query) and consumes less disk space than using "geohash" so that's why quadtree encoding is recommended.

Now, index and query speed is directly related to shape size and resolution because under the hood of Lucene each quad raster cell is converted into a bounding box and compared to the vectorized shape using a call to the JtsGeometry.relate method you reference in #2157. The higher the resolution the more cells are created (like a high resolution image) so the more calls to relate; and that method is slow. This is the general problem with using a raster based quadtree. For this reason we're working toward migrating to a B-kd based geo_shape indexing approach and (eventually) moving away from quadtrees altogether.

Hopefully that makes more sense.

@karesh

This comment has been minimized.

Copy link

karesh commented Jul 24, 2017

@nknize is this the reason for my issue #25844 ?
for bigger polygon the index is slow.

@nknize

This comment has been minimized.

Copy link
Member

nknize commented Jul 25, 2017

@karesh That's right. For the issue you describe in #25844 I'd recommend taking a similar mapping approach to what's suggested above (e.g., set precision to something like 10m and default_error_pct to something like 0.025 - but play around w/ these parameters) . If the precision parameter is set, the distance_error_pct parameter is automatically set to 0; rationale being that users expect precision accuracy when they define an explicit precision. However, this can lead to performance issues as these quad cells are created (at both index and query time). The distance_error_pct will help alleviate these performance issues at the cost of false positives. At the moment there is no post-filtering of the results returned by lucene so you'd have to post filter the results in your application.

@davemarsland

This comment has been minimized.

Copy link

davemarsland commented Jul 27, 2017

Just to +1 this, after advice from @gmoskovicz we've changed to the quadtree type and have seen a dramatic improvement in all metrics, especially indexing and search latency (in the case of search latency, from ~50ms to sub 1ms). Feels like as long as it's explained well, it should be the default.

@synhershko

This comment has been minimized.

Copy link
Contributor

synhershko commented Dec 20, 2017

+1 for changing the defaults as indicated. And just noting that for a project I'm assisting, with several simple polygons indexed as precision: 10m and tree: quadtree, indexing is very slow, consumed disk space is very large, and setting distance_error_pct: 0.001 doesn't really help with either.

@nknize any ETA for a bkd-tree based geo-shape indexing?

@gmoskovicz

This comment has been minimized.

Copy link
Contributor Author

gmoskovicz commented Dec 20, 2017

@synhershko have you seen any improvement testing the defaults vs quadtree with some error percentage?

I wouldn't expect it to fix every single case (because #25833 (comment)) but i would expect it to have a significant change comparing to the defaults.

@synhershko

This comment has been minimized.

Copy link
Contributor

synhershko commented Dec 20, 2017

Using 5.6.0 and I'm assuming the default is distance_error_pct: 0 - so we compared it to that with no significant improvement (or none at all). Started with quadtree to begin with.

@synhershko

This comment has been minimized.

Copy link
Contributor

synhershko commented Dec 20, 2017

fwiw the most significant drop in performance (and increase in disk size) was when we went from precision: 50m to precision: 10m (both are quadtree and distance_error_pct: 0.001). In our tests, 50m gives us bulk of 500 in ~1s and about 50kb/doc on disk while 10m is bulk of 500 in ~5s and about 150kb/doc on disk.

@polyfractal

This comment has been minimized.

Copy link
Member

polyfractal commented Dec 10, 2018

Since BKD-backed geo_shapes are soon to be the default (#35320), I'm going to close this issue.

I think the distance_error_pct default still has merit though, and will be addressing that in a different ticket (which results in OOM).

@nknize

This comment has been minimized.

Copy link
Member

nknize commented Dec 10, 2018

++ #35320 also sets the default tree parameter to quadtree for PrefixTrees

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.