-
Notifications
You must be signed in to change notification settings - Fork 988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EsHadoopIllegalStateException reading Geo-Shape into DataFrame - SparkSQL #607
Comments
The issue here is that the connector doesn't know what to translate the |
I think a String containing JSON would work for us, thanks. That would be much better than fatal exception. |
@randallwhitman Hi, I've taken a closer look at this and it's a bit more complicated. Fixable but it's not as easy as I thought. First off, field "crs" is null meaning it is not mapped - there's no type information associated with it and thus, no mapping. So the connector doesn't even see it when looking at the mapping so when it encounters it in the |
Interesting, thanks for the update. Is there any merit to an option to treat the field as String while reading, so the result is a raw String of JSON? We'd be able to post-process the JSON with a method such as GeometryEngine.geometryFromGeoJson. |
You mean reading the field in raw json format instead of parsing it? You could do such a thing but plugging a customized |
I'll take a look at the link you provided, thanks. |
I roughed out some code that reads val base:RDD[String,Map[...]] = EsSpark.esRDD(...)
val rtmp:RDD[Row] = base.map(... case geo_shape => convertMapToString ...)
val schema = ... // application-specific interpretation from mapping
val df = sqlContext.createDataFrame(rtmp, schema) The workaround as I have it now converts JSON to Map and back to JSON again before parsing an object. Perhaps I could work around that with But as this was referred to as a workaround, I understand it to be not the recommended approach, but rather a temporary workaround until this issue is resolved. |
Thanks for the update. The double JSON conversion is wasteful (not to mention the connector can/already does it). |
@randallwhitman Hi, this has been fixed in master - can you please try the latest dev build ? Please try it out and let me know if it works for you. |
It's already in there - |
Today I am trying this out. I am trying it with polygon geometries.
I tried setting the value both to Either way I am still seeing the exception.
|
What version of ES are you using? |
I think I found the culprit - and that is not plugging the automatic array detection into the dedicated Cheers, |
The server running is |
Why not use |
API 1.6.2 and API 1.7.4 both, same exception. |
Add dedicated parsing and handling of Geo types and inferring of data based on 'sampling' of data. As Geo types are not properly described into their mappings (ES provides only `geo_shape` and `geo_point` but there's no information about the geo type used), ES-Hadoop now detects a geo field and will parse it in an ad-hoc manner. However for strongly-typed environments (such as Spark SQL), it will 'sample' the data, by asking for one document so the actual content will be parsed in order to determine the format and use that for the inferred data set. relates #607
@randallwhitman I've just pushed a fix for your issue in master. It is fairly consistent especially on the spark side so please try it out. So to go around this, ES-Hadoop now detects when a field is of geo type and, in case of Spark SQL, will sample the data (get one random doc contains all the geo fields), parse it, determine the format and in turn generate the schema. tl;dr - you should just point the latest ES-Hadoop dev snapshot to your data set and that's it - the schema should be inferred automatically. Cheers, |
Add dedicated parsing and handling of Geo types and inferring of data based on 'sampling' of data. As Geo types are not properly described into their mappings (ES provides only `geo_shape` and `geo_point` but there's no information about the geo type used), ES-Hadoop now detects a geo field and will parse it in an ad-hoc manner. However for strongly-typed environments (such as Spark SQL), it will 'sample' the data, by asking for one document so the actual content will be parsed in order to determine the format and use that for the inferred data set. relates #607
The first time I got the test to run today, I got a different error, but i had left in the
|
Looks like you are running ES pre 2.0 - will look into adding a compatibility fix for that. |
Yes, the server is running 1.6.2 version. |
Pushed a dev build that should address the issue - can you please try it out? |
It's the date that is important, more than the commit which only gets updates if the change is committed. If the code is not (for whatever reason), the commit signature will remain the same. In cases like these, I ended up publishing the build before committing hence why the git SHA is the same. |
[SPARK] Perform ES discovery before mapping discovery relates #607
Published again another snapshot which should have the git SHA updated. Note the geo functionality is available only in the integration for Spark 1.3 or higher (if you are using |
|
The bug is caused by the shape name (which is expected to be lower case not mixed). Pushed a fix in master and just uploaded a new dev version. Please try it out. |
I hit snags re-running tests - I will look again tomorrow. |
I am consistently seeing a
|
What version of Spark are you using? |
I thought I was using Spark-1.4 but I will double-check by re-running with an explicit |
Should be fixed in master; also pushed a new dev build - can you please try it out? Thanks, |
With Spark-1.4:
|
Can you please post your mapping and a sample data set along with a gist with the logs ( Also do note that the data is expected to have the same format (since that's what Spark SQL expects). If your geo_shapes are of different type, I'm afraid there's not much we can do - not if you want to use |
In the geo-shape test, the test data is a single polygon. val rawShape = List(
"""{"rect":{"type":"Polygon","coordinates":[[[50,32],[69,32],[69,50],[50,50],[50,32]]],"crs":null}}""")
val rdd1 = sc.parallelize(rawShape, 1)
rdd1.saveJsonToEs(shapeResource, connectorConfig) |
I realize that but the info it's not very helpful - the log and the mapping however are. |
I won't be able to get to that right away. |
ES allows extra fields to be specified for geo types. relates #607
Found out what the issue was - geo types for some reason accept custom fields (like I've pushed a fix for this and published a dev build - can you please try it out? (the usual drill :) ). |
Right, GeoJson can contain "crs" and/or "bbox". With that patch, now my test passed, thanks! |
And there was much rejoicing. Let's give this some extra days to see whether it passes all your tests and then I'll close it down. Cheers, |
OK. When I println the result, I see |
There are no guarantees (we currently control the schema but that might change). However it should be irrelevant as one can get access to the items through the name. |
Closing the issue. |
Hi @costin , through spark java we are also facing issues while pushing geo-shape to elastic search index . |
@Bomb281993 Please refrain from mentioning users on old issues like this. If you are seeing errors with geo-shape indexing, please post those errors and a description of the problem on the forum or in a new issue. |
RDD[String]
containing a polygon as GeoJSON, as the value of a field whose name matches the mapping:"""{"rect":{"type":"Polygon","coordinates":[[[50,32],[69,32],[69,50],[50,50],[50,32]]],"crs":null}}"""
rdd1.saveJsonToEs(indexName+"/"+indexType, connectorConfig)
esDF
orread
-format
-load
:sqlContext.esDF(indexName+"/"+indexType, connectorConfig)
sqlContext.read.format("org.elasticsearch.spark.sql").options(connectorConfig).load(indexName+"/"+indexType)
Result is:
org.elasticsearch.hadoop.EsHadoopIllegalStateException: Field 'rect' not found; typically this occurs with arrays which are not mapped as single value
Full stack trace in gist. Elasticsearch Hadoop v2.1.2
The text was updated successfully, but these errors were encountered: