Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change geo_shape default configuration #2756

Closed
jillesvangurp opened this issue Mar 10, 2013 · 7 comments
Closed

change geo_shape default configuration #2756

jillesvangurp opened this issue Mar 10, 2013 · 7 comments

Comments

@jillesvangurp
Copy link
Contributor

Currently geo_shape defaults to the geohash implementation and a rather high tree_levels setting. I've been testing with some open street map data consisting of about 120K points, polygons, and linestrings in the Brandenburg area on Simon Willnauer's lucene 4.2 branch which is about to be merged in the next few days. The pre lucene 4.2 geo_shape seems to have severe accuracy problems currently.

When I index this data with the default settings (geohash, tree_levels=24), I end up with nearly a GB of index data (the raw data is 35MB). Indexing takes nearly ten minutes using the bulk api, batches of 500, and with six threads Reducing the tree_levels to 9 reduces the index size substantially. However with this setting, the accuracy drops significantly and such indiceses are useless for searching unless you find margins of errors measured in kilometers acceptable.

I've also tested with the quadtree implementation. The default setting for this offers very poor accuracy and returns tens of thousands of results for queries that should return only a few dozen results. However, increasing the tree_levels to 20 results in pretty good accuracy (a few tens of meters margin of error) and an index size of only 72MB, which is very reasonable and it indexes in seconds. Increasing the tree_levels to 25 increases the index size to half a GB and the accuracy improves to being comparable with geohash at its default settings.

Based on this, I would like the defaults for geo_shape to be changed as follows:

  1. make quadtree the default implementation. I think that overall it offers better index size, indexing speed than geo_hash. A smaller index size also means it probably performs better on queries and uses less memory.
  2. increase default tree_levels for quadtree to 20. This seems to give a reasonable tradeoff between index size, indexing speed, and accuracy. The user can increase it for better accuracy. The current default does not provide anywhere near acceptable levels of accuracy.
@ghost ghost assigned s1monw Mar 11, 2013
@s1monw
Copy link
Contributor

s1monw commented Mar 11, 2013

Jiles, thanks for opening this issue. This provides valuable input and I agree we should totally fix our defaults here. I wonder if we should adjust the default levels for GeoHash as well here since it seems to bloat the index dramatically we might change that too to prevent pitfalls?

I wonder if we should also allow the users to configure this via enums rather than levels where you an say, "HIGH_ACCURACY" rather than quad_tree with level 24? I guess that would mean much more to a lot of users?

Regarding Lucene 4.2 I am just waiting for the maven mirrors to pick up the release so that should be in later today.

@IamJeffG
Copy link
Contributor

Related: Given that we require extremely fine resolution in the index (i.e. the lowest possible tree levels) to avoid exorbitant false positives, I have trouble imagining a scenario where anyone would want distance_error_pct to be > 0.0. (All distance_error_pct does is allow higher levels of the tree to be used -- resulting in still more errant results). If we are going to be serious about improving accuracy in geo_shape queries, it worth defaulting distance_error_pct to 0.0?

In either case, defaulting tree_levels greater than about 14 will result in unacceptable latency for anybody who tries to index shapes on the order of hundreds of kilometers, or query over similar extents. The ES mapping must generate all the tiles over your shape, which ends up being an algorithm exponential in the number of levels. For instance, if I am indexing shapes on the scale of countries or timezones, and I set quadtree tree_levels to 20, an index operation generates hundreds of thousands of cells for the Lucene index, driving up latency to > 10 seconds per document.

To effectively use the geo_shape mapping really requires an understanding by the end-user of its implementation and trial and error around ideal tree_levels settings for his/her documents. Better ES documentation would do wonders here. Frankly, I feel that merely changing the defaults is a mere optimization for a presumed use case (we have no idea what shapes the typical user is indexing), and skirts two bigger problems:

  1. no spatial index is immune from errant matches without actually verifying candidates against the query (see Issue GeoShape intersects operation returns false positives #2361)
  2. The quadtree and geohash implementations do not exploit the tree structure of the index at query time. Rather they pick a single tree level to create a naive grid over the earth.

@jillesvangurp
Copy link
Contributor Author

I partially agree, though I think the majority of users is probably interested in smaller features like pois, building polygons, etc. The current defaults with quad tree are kind of useless for that. With geohash they are fine but you do end up with a very large index

For extremely large polygons, there is indeed an issue at higher tree_levels. Maybe the behavior should vary depending on the size of the geojson feature being indexed/queried instead of being fixed for all documents. I'm not sure how feasible that is with the current implementation.

In any case, I've uploaded the osm data I have been testing with here. It's derived from open street map and I've also included a public domain file with geojson for the country borders: https://dl.dropbox.com/u/18756426/osm-geojson.zip

@jillesvangurp
Copy link
Contributor Author

@s1monw I think changing the parameters to be less technical would be a good idea. I think people think about accuracy in terms of meters. It should be fairly straightforward to translate that into the appropriate tree_levels for each implementation.

@s1monw
Copy link
Contributor

s1monw commented Mar 20, 2013

@jillesvangurp can we close this with the push or #2803

@jillesvangurp
Copy link
Contributor Author

Yes, just looked at the commit and new defaults seem fine. I'll close it.

@s1monw
Copy link
Contributor

s1monw commented Mar 20, 2013

thanks man!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants