Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot index some geo_shape geometries (before and after Geo Refactoring changes) #3909

Closed
tommcintyre opened this issue Oct 15, 2013 · 10 comments
Assignees
Labels
:Analytics/Geo Indexing, search aggregations of geo points and shapes high hanging fruit

Comments

@tommcintyre
Copy link

Probably FAO @chilling .

I have a mapping that includes some geo_shape fields. My test data contains GeoJSON fields that specify points, but do so in the form of polygons with the same lon/lat repeated 4 (or sometimes 3) times. This causes a validation exception to be thrown up from within Spatial4J when it generates the polygon.

I am not sure if this is technically invalid GeoJSON or not - however, this is a form that the Twitter API generates frequently (around 1 in 1500 tweets in my dataset), and other libraries I have used can parse it OK. It would be good if ES can attempt to do what it can with data like this, rather than failing (i.e. treat it as another appropriate type like a point, or relax the verification).

This is tested against the branch that includes the Geo-Refactoring improvements at chilling/elasticsearch@0369983 (it does also happen on master, if you can get that far!).

The stack trace:

org.elasticsearch.index.mapper.MapperParsingException: failed to parse [place]
    at org.elasticsearch.index.mapper.geo.GeoShapeFieldMapper.parse(GeoShapeFieldMapper.java:232)
    at org.elasticsearch.index.mapper.object.ObjectMapper.serializeObject(ObjectMapper.java:515)
    at org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:457)
    at org.elasticsearch.index.mapper.object.ObjectMapper.serializeObject(ObjectMapper.java:515)
    at org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:457)
    at org.elasticsearch.index.mapper.object.ObjectMapper.serializeObject(ObjectMapper.java:515)
    at org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:457)
    at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:508)
    at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:452)
    at org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:341)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:203)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:533)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:418)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    ... 1 more
Caused by: com.spatial4j.core.exception.InvalidShapeException: Too few distinct points in geometry component at or near point (-81.872495, 36.163117, NaN)
    at com.spatial4j.core.shape.jts.JtsGeometry.(JtsGeometry.java:90)
    at org.elasticsearch.common.geo.builders.BasePolygonBuilder.build(BasePolygonBuilder.java:153)
    at org.elasticsearch.index.mapper.geo.GeoShapeFieldMapper.parse(GeoShapeFieldMapper.java:219)
    ... 15 more

To reproduce:

curl -XPUT http://server:9200/dummyindex -d '
{
    "mappings": {
        "dummytype": {
            "properties": {
                "place": {
                    "type": "geo_shape",
                    "tree": "quadtree",
                    "precision": "10m"
                }
            }
        }
    }
}'

curl -XPOST http://server:9200/dummyindex/dummytype/1 -d '
{
    "place": {
        "coordinates": [[[-81.872495, 36.163117], [-81.872495, 36.163117], [-81.872495, 36.163117], [-81.872495, 36.163117]]],
        "type": "Polygon"
    }
}'

I wrote a nasty, hacky workaround to fix this particular case - see below. This is obviously not an acceptable general purpose solution, as it doesn't address the issue for any other shapes, or other cases like only having 2 or 3 distinct points. I am not really familiar enough with the libraries, but I guess a real fix might involve simplifying or normalizing each geometry somehow before it gets passed to Spatial4J?

Alternatively it could be deemed that this is not ES's job, and the GeoJSON needs to be more well-formed - but I think this is likely to be a common problem due to the source of this data, and because of the different expectations of different GeoJSON libraries.

ShapeBuilder.java:

-        protected static PolygonBuilder parsePolygon(CoordinateNode coordinates) {
+        protected static ShapeBuilder parsePolygon(CoordinateNode coordinates) {
             LineStringBuilder shell = parseLineString(coordinates.children.get(0));
+            
+            if (new HashSet<Coordinate>(shell.points).size() == 1)
+                return newPoint(shell.points.get(0));
+            
             PolygonBuilder polygon = new PolygonBuilder(shell.points);
             for (int i = 1; i < coordinates.children.size(); i++) {
                 polygon.hole(parseLineString(coordinates.children.get(i)));

...

         protected static MultiPolygonBuilder parseMultiPolygon(CoordinateNode coordinates) {
             MultiPolygonBuilder polygons = newMultiPolygon();
             for (CoordinateNode node : coordinates.children) {
-                polygons.polygon(parsePolygon(node));
+              polygons.polygon((PolygonBuilder)parsePolygon(node));
             }
             return polygons;
         }
@tommcintyre
Copy link
Author

Some chat from IRC with opinions about whether this belongs in ES or not... :)

<Tim> GeoJSON spec is a little vague. A polygon must have a LinearRing, defined as:
<Tim> A LinearRing is closed LineString with 4 or more positions. The first and last positions are equivalent (they represent equivalent points). Though a LinearRing is not explicitly represented as a GeoJSON geometry type, it is referred to in the Polygon geometry type definition.
<Tim> not really clear that they must be distinct or non-self-intersecting
<Tom> yeah - i think it's just not well defined
<Tom> the new geo-refactored ES code does seem to attempt to make the best possible use of "borderline" valid geometries
<Tom> for example, twitter emits open linestrings quite often, as well these polygons with repeating points
<Tom> and the new geo-refactored ES code does now accept open linestrings
<Tom> (the old code didn't)

@ghost ghost assigned chilling Oct 16, 2013
@chilling
Copy link
Contributor

Hi @mvjars,

I think the GeoJson is technically valid. The problem is forced by the underlying libraries that we use to handle geo data inside ES. These libraries make some assumptions to the data which are checked before the data gets indexed. In my opinion this assumptions must not necessarily be checked if the GeoJson is indexed. But currently we use this libraries to index geo data also. I'm working on separating the GeoJson format specification from the logic layer and move this kind of exceptions to the logic layer.

@lababidi
Copy link
Contributor

Is there any update on this issue?

@clintongormley
Copy link

@nknize any ideas here?

@nknize
Copy link
Contributor

nknize commented Apr 26, 2015

@lababidi Out of curiosity why make this a polygon type? Why not change it to a multipoint?

@lababidi
Copy link
Contributor

@nknize Thanks for checking in. I'm not sure I understand what your question is. Let me see if I can reframe the issue, succinctly.

Twitter provides the following object. It is rejected by elasticsearch for only having 4 elements and not 5. This strict requirement is not consistent with GeoJSON. Please allow elasticsearch to accept this Object, out of the box, because technically it is legal GeoJSON. Currently I must write a middle-man to add a 5th-element (cue Bruce Willis jokes here) to this array of coordinates. This is a bit kludgy in terms of workflow:

"bounding_box": {
        "type": "Polygon",
        "coordinates": [
        [
        [
        -9.0915413,
        38.6713816
        ],
        [
        -9.0915413,
        38.9313732
        ],
        [
        -8.8102479,
        38.9313732
        ],
        [
        -8.8102479,
        38.6713816
        ]
        ]
        ]
        },

@nknize
Copy link
Contributor

nknize commented Apr 28, 2015

@lababidi Thanks for the update. I mistook the problem you're describing for the original issue (repeated points in a polygon). I agree we should accept Polygons like this and close them for you.

@lababidi
Copy link
Contributor

This is excellent news. I'd be happy to contribute any code if your plate is full. Just point me to where the preferred location that this logic belongs in and I will pull request.

@lababidi
Copy link
Contributor

@nknize If you prefer to close this issue, you may. I realized my statement was a separate issue from this issue and create it at #11131

@nknize
Copy link
Contributor

nknize commented May 14, 2015

@lababidi Excellent! If you'd like to get involved we'd love to have you contribute. You'll need to sign the CLA if you haven't already. I went ahead and opened a PR to address #11131, have a look and try it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Geo Indexing, search aggregations of geo points and shapes high hanging fruit
Projects
None yet
Development

No branches or pull requests

6 participants