Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't insert to ES - Malformed content, must start with an object #178

Closed
whitfin opened this Issue Mar 26, 2014 · 16 comments

Comments

Projects
None yet
2 participants
@whitfin
Copy link

whitfin commented Mar 26, 2014

So I'm using the library to transfer a bunch of JSON over to ES from Hadoop. I see my connection and everything just fine inside my ES console, but I see this error:

org.elasticsearch.index.mapper.MapperParsingException: Malformed content, must start with an object
    at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:489)
    at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:462)
    at org.elasticsearch.index.shard.service.InternalIndexShard.prepareCreate(InternalIndexShard.java:371)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:400)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:153)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:556)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:426)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

I'm invoking my OutputCollector with just this output.collect(key, new Text(value));, where value is just a JSON string.

This is what ES shows:

index {[latest][event][dNv98KOCSjuhbLXMogOI1Q], source["<myJSONstring>"]}

Is there some way I should be transforming my JSON string? I've tried with conf.set("es.input.json", "yes"); but nothing seems to be working.

Thanks in adv.

@costin

This comment has been minimized.

Copy link
Member

costin commented Mar 26, 2014

What does your JSON string look like? The error indicates that at some point, some document might be malformed.
If you specify a JSON string, es-hadoop will not do any processing what-so-ever; it passes the data as is.
Try to isolate the sample data and if you're sure the JSON doc is valid, enable TRACE logging on org.elasticsearch.hadoop package and you'll see all the data that is sent to ES.

P.S. What version of es-hadoop are you using?

@whitfin

This comment has been minimized.

Copy link
Author

whitfin commented Mar 26, 2014

@costin I'm using 1.3.0.M2. The one thing I did notice is that the printed JSON (in the ES console) has a lot of \\\ before any quotes. Could this be part of the issue?

I ran my raw JSON through a validator and it all looks good.

How would I enable the TRACE logging?

Edit: Here is the entire error, with a new JSON string:

[2014-03-26 16:58:19,769][DEBUG][action.bulk              ] [Caiera] [latest][2] failed to execute bulk item (index) index {[latest][event][aN1URSLyRhCQayQulyn3tw], source["{\"title\": \"The Godfather\",\"director\": \"Francis Ford Coppola\",\"year\": 1972,\"genres\":[\"thriller\",\"mobster\",\"dance-parody\"]}"]}
org.elasticsearch.index.mapper.MapperParsingException: Malformed content, must start with an object
    at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:489)
    at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:462)
    at org.elasticsearch.index.shard.service.InternalIndexShard.prepareCreate(InternalIndexShard.java:371)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:400)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:153)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:556)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:426)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

And the string:

{"title": "The Godfather","director": "Francis Ford Coppola","year": 1972,"genres":["thriller","mobster","dance-parody"]}

I also noticed, this error was thrown 4 times even though the JSON only appears once (although, I think this could just be retries).

@costin

This comment has been minimized.

Copy link
Member

costin commented Mar 27, 2014

@IWhitfield Most likely - printed string should have no escaping (as you would for example when constructing the string programatically) - make sure that doesn't happen (maybe the string gets escaped one too many times ?).

@whitfin

This comment has been minimized.

Copy link
Author

whitfin commented Mar 27, 2014

@costin The string I'm testing with is just

{"title": "The Godfather","director": "Francis Ford Coppola","year": 1972,"genres":["thriller","mobster","dance-parody"]}

This string is just read in as Text and then collected by the OutputCollector.

Do you have any examples of what I'm trying to achieve?

@costin

This comment has been minimized.

Copy link
Member

costin commented Mar 27, 2014

That String looks valid but you need to be sure that's actually the data being passed out (you can log it before setting the Text).
The doc probably needs updating to include some example for MR; in the meantime take a look at the tests
https://github.com/elasticsearch/elasticsearch-hadoop/tree/master/src/test/java/org/elasticsearch/hadoop/integration/mr

@whitfin

This comment has been minimized.

Copy link
Author

whitfin commented Mar 27, 2014

@costin thanks for the link - I'll take a look.

Here is a snippet of my code, does anything look incorrect?

public static class MapHandler extends Mapper<LongWritable, Text, Text, Text> {
    public void map(LongWritable key, Text value, OutputCollector<Text, Text> output) throws IOException {
        Map<String, Object> json = mapper.readValue(value.toString(), Map.class);
        Text name = new Text(json.get("name").toString());
        output.collect(name, new Text(value));
    }
}

public static class ReduceHandler extends Reducer<Text, Iterable<Text>, Text, Text> {
    public void reduce(Text key, Iterable<Text>values, OutputCollector<Text, Text> output) throws IOException {
        for(Text value : values){
            output.collect(key, value);
        }
    }
}

I do have conf.set("es.input.json", "yes"); inside my main, too.

@costin

This comment has been minimized.

Copy link
Member

costin commented Mar 27, 2014

Looks good but make sure your Reducer is actually used - unless you do some aggregation or something like that in the majority of cases jobs are mapper-only so make sure that one looks okay.
From what I can tell your error is wrong since you are writing a Map to the output - for writing writables to elasticsearch, get rid of the "es.input.json" since you are not passing JSON to Elasticsearch.

@whitfin

This comment has been minimized.

Copy link
Author

whitfin commented Mar 27, 2014

@costin Ah, does es.input.json only go alongside passing the Map<Writable, Writable>?

I tried with that method and received the same issue.

@costin

This comment has been minimized.

Copy link
Member

costin commented Mar 27, 2014

I suggest re-reading http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/mapreduce.html#_writing_data_to_elasticsearch

es.input.json does NOT work with Map<Writable, Writable>.
Your reducer doesn't do anything so you can just eliminate it (or use the identity one). Also the issue seems to be that you are wrapping the Text within another Text - no need, just pass the original Text and that's it.
Also EsOuputFormat ignores the key so you can pass null or NullWritable

@whitfin

This comment has been minimized.

Copy link
Author

whitfin commented Mar 27, 2014

@costin okay, that's what I thought (sorry for the confusion there).

The Reducer is required here because once I get the JSON passed to ES correctly, I'll be expanding on it. Am I right in thinking that the Reducer will receive the JSON payloads grouped by the same name?

Also, I saw the _bulk API required a \n, and it looks like es-hadoop uses this - could this be part of it ?

@costin

This comment has been minimized.

Copy link
Member

costin commented Mar 27, 2014

My suggestion is to eliminate any code that is not needed and get to phase where things are running (see the tests). This means having a basic mapper which has a full raw JSON sent to the output and then to ES.
Feel free to add the \n however I would investigate why you keep getting the extra \\\.

@whitfin

This comment has been minimized.

Copy link
Author

whitfin commented Mar 27, 2014

@costin thanks for the suggestions, I no longer get \\\ with the basic string above.

I'll keep trying and update if anything changes.

@whitfin

This comment has been minimized.

Copy link
Author

whitfin commented Mar 27, 2014

@costin how would you suggest I log out during my mapping phase ? It doesn't seem to be outputting into the Hadoop logs?

@costin

This comment has been minimized.

Copy link
Member

costin commented Mar 27, 2014

There are multiple options depending on your environment. System.out works for both local mode and distributed/remote one.
You can also use log4j:
https://www.google.com/search?q=hadoop+log4j

You could take a step back and create a simple hadoop job that dumps the data to HDFS so you can analyze it afterwards
if it's easier.
Once you can verify the JSON is valid, you can move to ES.

So far we've been 'remote debug guessing' mistakes in the Hadoop job instead of an actual issue in es-hadoop so I'm
inclined in closing this down...

On 3/28/14 12:26 AM, Isaac Whitfield wrote:

@costin https://github.com/costin how would you suggest I log out during my mapping phase ? It doesn't seem to be
outputting into the Hadoop logs?


Reply to this email directly or view it on GitHub
#178 (comment).

Costin

@whitfin

This comment has been minimized.

Copy link
Author

whitfin commented Mar 28, 2014

@costin that's a fair point, although originally it did look to me as an es-hadoop issue. I'll close, and reopen if there is more evidence of it being es-hadoop.

Thanks for your help.

@whitfin whitfin closed this Mar 28, 2014

@costin

This comment has been minimized.

Copy link
Member

costin commented Mar 28, 2014

Thanks. Feel free to reach out on the mailing list or IRC.

On 3/28/14 6:28 PM, Isaac Whitfield wrote:

@costin https://github.com/costin that's a fair point, although originally it did look to me as an es-hadoop issue.
I'll close, and reopen if there is more evidence of it being es-hadoop.

Thanks for your help.


Reply to this email directly or view it on GitHub
#178 (comment).

Costin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.