-
Notifications
You must be signed in to change notification settings - Fork 986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import to pig fails when JSON contains an array #36
Comments
Can you upload a small snippet of your data set so I can replicate the problem on my side? Just upload it somewhere since I'm not sure Github can handle attachments and posting here might cause whitespace problems. |
I added a short (1 result) json document that shows the result from an elasticsearch query on my system. The result set contains an array in lines 19, 22, and 23, any of which will cause the problem. https://github.com/bwmeier/elasticsearch-hadoop/blob/master/elasticsearch_array_result.json If you have any questions about the result, let me know. |
The error's changed slightly, but not significantly. I now get the following exception trace, which again looks like the array processing, just with the ArrayList type instead of the Byte[]. I suspect that pig needs a translation from ArrayList to DataBag, or something of the sort, since the message is related to "standard Pig type".
|
Understood - I've reopened the issue and will try to address it by next week. |
Thanks @costin, I appreciate the work. I'll test this out when I get the chance, but it won't be for a few days :-) |
Hi Boyd, I think I've found (and fixed) the issue. Pig complex types were not handled properly when reading them back from Pig. Note that the conversion (not just for Pig but Hive, M/R etc...) will be overhauled however it should be working just fine now - the only issue is that bags are converted to tuples when are deserialized (and that's because we don't hold yet any extra information to allow us to differentiate between tuple and bags). It would be great if you could try out the latest master - see the readme on how to get the latest nightly build (I've just pushed one right now: http://build.elasticsearch.org/browse/ESHADOOP-NIGHTLY-40 |
Hi Costin, I've been on vacation since the 3rd, so I'm just now getting the On Sat, May 4, 2013 at 3:19 AM, Costin Leau notifications@github.comwrote:
|
Costin, the fix works. However, the performance is pretty poor - I was able to process 37500 records and write them to JsonStorage, but it took about 45 minutes, I'm not sure if it's related to the fix or not. I tested the version of the fix pointed to by that commit, I have not tested anything later than that. |
Note that there's an upcoming serialization improvement for writing to ES coming hopefully in the next few days. |
When the JSON in the result set contains an array, then Pig 10.0 fails during internal serialization with an exception trace similar to the following. It appears that the array is being serialized as a Byte[] at some level, and Pig cannot handle that.
The text was updated successfully, but these errors were encountered: