Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geo-shape indexing is very slow #22087

Closed
mohmad-null opened this Issue Dec 9, 2016 · 10 comments

Comments

Projects
None yet
10 participants
@mohmad-null
Copy link

mohmad-null commented Dec 9, 2016

I'm seeing very slow indexing with ElasticSearch when I set up a geo-shape type.
I'm using a stock 5.0.0 on Windows 7, no customisations, and I'm accessing it via the Python (3.5) wrapper. One shard.

I have 75 documents to index and if I don't set up a mapping, they index near instantly.

But the second I include a geo-shape mapping, the processing time rockets, as does CPU usage and inevitably it times out on the index creation, even with the timeout increased to 30s.

My mapping for the geo-shape looks like this:
"bbox": {
"type": "geo_shape",
"precision": "100m"
},
The rest of the mapping is just type:text. _all is disabled.

Each document has up to two geo-shapes. The shapes are simple envelope types as they're only bounding boxes - spatially this should be the simplest indexing there is for a polygon.

The problem is exagerated by the precision, but even with a precison of 1000m, when I send in larger quantiies of documents, it still flakes out after only a few hundred have been indexed.

Changing the tree type to quadtree is several times faster than the geohash, but it still times out after using considerably more resources.

I've seen the section on "performance considerations" in the docs (https://www.elastic.co/guide/en/elasticsearch/reference/current/geo-shape.html ) - but this is far slower than I would expect any spatial indexing to occur.
In a conventional spatial RDBMS (PostGIS/Oracle) I'm used to being able to insert hundreds if not thousands of complex spatial features every second.
ElasticSearch is failing to handle even a few dozen a second of basic bounding boxes geometries.

I appreciate I haven't optimised my deployment (still in dev), but an unoptimised RDBMS is orders of magnitude faster at the same operation. I think there's considerable scope for optimisation here, especially compared to ES's non-spatial indexing.

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented Dec 12, 2016

@nknize could you provide some wisdom please

@uamadman

This comment has been minimized.

Copy link

uamadman commented Dec 26, 2016

I'm experiencing something similar. I've created a brutish python script for showcasing this issue.
The following shows a simple python script that creates a GeoShape mapping and adds data.
This method indexes roughly 5 line-strings per second/ 2 minutes per 1000, while triggering the GC Allocation failure for the entirety of the load.

Of my current single node systems I would expect a minimum 1000 geo_shapes per second.

from random import random, randint
import math
import time

try:
    from elasticsearch import Elasticsearch

    es = Elasticsearch()
except ImportError:
    quit()

es_index = 'assets'


def get(query, index=es_index, es=es):
    body = {
            "query": query,
            "size": 1000
            }
    ret = es.search(index=index, body=body)
    return ret


def put(uuid, json, doc, index=es_index, es=es):
    ret = es.create(index=index,
                    doc_type=doc,
                    id=uuid,
                    body=json)
    return ret


def delete_index(index=es_index, es=es):
    ret = es.indices.delete(index=index)
    return ret


def create_index(index=es_index, es=es, number_of_shards=1, number_of_replicas=0):
    settings = {
        "settings": {
            "number_of_shards": number_of_shards,
            "number_of_replicas": number_of_replicas
        },
        "mappings": {
            "player": {
                "properties": {
                    "PATH": {
                        "type": "geo_shape"
                        # "precision": "10m"
                    }
                }

            }
        }
    }
    ret = es.indices.create(index=index, body=settings)
    return ret


def random_elastic_linestring(nodes, mx, my, Mx, My):
    linestring = []
    for n in range(0, nodes):
        linestring.append([random.randint(mx, Mx), random.randint(my, My)])
    es_doc = {
        "type": "linestring",
        "coordinates": linestring
    }
    return es_doc



if __name__ == "__main__":      
    try:
        delete_index()
    except:
        print("It doesn't exist")
    create_index()
    for i in range(1, 1000):
        x = (random() - .5) * 1000
        y = (random() - .5) * 1000
        a = {'id': 'randomname' + str(i),
             'asset': 'randomname',
             'Zone': 'X' + str(math.floor(x)) + 'Y' + str(math.floor(y)),
             'Travel_Rate': randint(400, 2000),
             'Draw_Level': 1,
             'HOME': None,
             'PATH': random_elastic_linestring(randint(2, 10), -5, -5, 5, 5),
             'DEL': False,
             '_time': time.time()}
        put(a['id'], a, 'player')

image

Elastic 5.1.1
JAVA_HOME: jdk 1_8_112
JVM Settings:
Xms4g
Xmx4g
-XX:+PrintGCDetails

datalocation: 25GB\s Ram Disk

@joelstewart

This comment has been minimized.

Copy link

joelstewart commented Mar 19, 2017

It is apparent for linestrings that are limited to a very small geo-area, the indexing speeds are acceptable for a large number of points in a linestring.

Building some random line strings with points limited to range:
private double minLat = 10.00d;
private double maxLat = 10.001d;
private double minLon = 10.00d;
private double maxLon = 10.0001d;

scales fairly well to hundreds of points. There is not an apparent "exponential" growth that happens when allowing the points to come from anywhere on the globe. That breaks down quickly and starts to show a non-linear growth at around 10 points, to where with even only a few dozen can take 30 seconds to index a single document, and attempting a 1000 will be over an hour.

(tested with 2.2.0 / 2.4.4)

@nknize - would you please follow up with what should be expected from line-string / multi-line-string indexing performance in regards to point count, point distance, and overall "complexity" of line shape.

@rsimon

This comment has been minimized.

Copy link

rsimon commented Mar 27, 2017

I can also confirm that geo_shape indexing is very slow. I have just migrated code from ES 1.7.2 to 2.4.4 and, interestingly, performance seemed to be at least one order of magnitude faster in the old version. I was using geohash tree type (precision 50m). Changing to quadree in v2.4.4 helped a bit to speed things up, but it's still significantly slower than in v1.7.2.

As soon as I remove the shape field from my mapping (with dynamic set to false) insert speed is back to the normal high levels, so it's definitely the geo_shape. Any thoughts yet on possible optimization strategies?

Another observation - not sure if this is of any help to track this down: the majority of my "shapes" are in fact Points. (I still need to index them as geo_shapes, since my data contains a mix of points and polygons.) So in my case, we're definitely not talking about complex shapes at all.

@instagibb

This comment has been minimized.

Copy link

instagibb commented Jun 23, 2017

Has anyone found a solution to this slow indexing of geo_shape?

I am using ES 5.4.2 docker container with the Java client with the bulk processor and indexing mostly text/number documents with some simple polygons. It is taking around 15+ minutes to index around 5000 docs.

Before adding the geo_shape mapping I was indexing 500,000 docs in ~2-3 minutes.

The slow down appears to be caused by submitting obsolete/bad GeoJSON. The JTS library I am using was outputting GeoJSON using an incompatible CRS (Coordinate Reference System) namely EPSG:3857 (Pseudo Mercator/WGS 84) and was including the block:

"crs": { 
    "type": "name", 
    "properties": { 
        "name": "EPSG:3857" 
    }
}

By making sure the GeoJSON was output using the CRS of EPSG:4326 (WGS 84) as per the GeoJSON spec (see the "note" in section 4) indexing speed has returned to normal.

Hopefully this helps somebody and I haven't just wasted everyones time :)

EDIT: Indexing geo_shape is still much slower than normal text documents but not as slow as submitting bad GeoJSON

@gem360

This comment has been minimized.

Copy link

gem360 commented Aug 4, 2017

I can also confirm that indexing geo shapes is very slow. When we try to index a geo shape line string with 9k points it takes around a minute to respond.

Using elasticsearch 2.4.

schema and example linestring attached

schema.txt
segment-9513.txt

@schlosna

This comment has been minimized.

Copy link

schlosna commented Aug 24, 2017

This is likely related to an issue in spatial4j/JTS when it is attempting to determine if a line covers a rectangle, see possible PR to address: locationtech/spatial4j#144

@nknize

This comment has been minimized.

Copy link
Member

nknize commented Oct 6, 2017

tldr: try setting tree to quadtree and distance_error_pct to 0.001

In short @schlosna is correct. There are a few things going on here that cause slow shape indexing. One is related to the number of terms (quad cells) and number of vertices. The more terms the more calls to jts.relate (which is slow). The more vertices the slower jts.relate becomes. Another is related to heap consumption in createCellIteratorToIndex where intersecting cells are collected into an in-memory list. See my related related comments in #25833 describing what's going on in lucene in a little more detail and what you can do (for now) to help with these issues.

There are a few patches coming to correct this issue in the near term (fix memory consumption and change tree defaults). In the long term we are working on a new geo field type based on Bkd tree that avoids the rasterization approach altogether (which has the added bonus of eliminating the jts and s4j dependencies).

@synhershko

This comment has been minimized.

Copy link
Contributor

synhershko commented Dec 20, 2017

In the long term we are working on a new geo field type based on Bkd tree that avoids the rasterization approach altogether (which has the added bonus of eliminating the jts and s4j dependencies).

+1. Would love to have an ETA on this, we are currently experiencing the same in a large scale project.

@nknize

This comment has been minimized.

Copy link
Member

nknize commented Mar 26, 2018

closing in favor of #25833 and #16749

@nknize nknize closed this Mar 26, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.