Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of nodes #188

Closed
Foolius opened this Issue Apr 11, 2014 · 25 comments

Comments

Projects
None yet
2 participants
@Foolius
Copy link

commented Apr 11, 2014

Me again with another, probably silly, question.
Im trying to index many files which are already in the hdfs.
Indexing single files is no problem, no I'm passing all the files which should be indexed at once through the command line using the script above.
For a few minutes it works fine, until it gives this error:

ERROR 2997: Encountered IOException. Out of nodes and retries; caught exception


register /home/kolbe/elasticsearch-hadoop-1.3.0.M2/dist/elasticsearch-hadoop-1.3.0.M2-yarn.jar
a = load '$files' using PigStorage() as (json:chararray);
store a into '$index' using org.elasticsearch.hadoop.pig.EsStorage('es.input.json=true','es.nodes=<adress>:9200');
@costin

This comment has been minimized.

Copy link
Member

commented Apr 11, 2014

@Foolius
First, please upgrade to the latest published release, which is 1.3.0.M3 - contains a lot of improvements and bug fixes - see this blog entry
If the problem persists, please turn on logging and report back.

Ideally, if you can upload same sample data somewhere (gist whatever have you) that would be great.

P.S. The exception indicates a network error - why it occurs after several minutes, I'm not sure...

@costin costin added bug labels Apr 11, 2014

@Foolius

This comment has been minimized.

Copy link
Author

commented Apr 12, 2014

After updating to the newest version, I get the following error:

ERROR 2998: Unhandled internal error. Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected

And I'm registering this file:

/home/foo/elasticsearch-hadoop-1.3.0.M3/dist/elasticsearch-hadoop-pig-1.3.0.M3.jar

this is the right one, right?

And thank you very much for your patience.

@costin

This comment has been minimized.

Copy link
Member

commented Apr 12, 2014

Yes, the jar is correct however I'm puzzled by the exception. Can you post the entire stacktrace somewhere along with your script?

Thanks!

@Foolius

This comment has been minimized.

Copy link
Author

commented Apr 12, 2014

@costin

This comment has been minimized.

Copy link
Member

commented Apr 12, 2014

Can you confirm the Pig and Hadoop version used?

@Foolius

This comment has been minimized.

Copy link
Author

commented Apr 12, 2014

2014-04-12 21:41:49,602 [main] INFO  org.apache.pig.Main - Apache Pig version 0.11.0-cdh4.3.0 (rexported) compiled May 27 2013, 20:48:21

Hadoop 2.0.0-cdh4.3.0
@costin

This comment has been minimized.

Copy link
Member

commented Apr 12, 2014

Looks like you bumped into a bug - sorry about that. I've pushed a SNAPSHOT containing a fix - can you please try it out and let report back? Thanks!

@Foolius

This comment has been minimized.

Copy link
Author

commented Apr 12, 2014

It runs for about 25 minutes now, so I think it works.

Again thank you very much for your help.

@costin

This comment has been minimized.

Copy link
Member

commented Apr 12, 2014

Great! If you have some logs (you should get some from pig) or stats let me know. I'd be interested to hear how much data have you pushed through and how is es-hadoop/es behaving.
If you check the docs, you'll see that es-hadoop provides stat information (accessible through API or available in the console) - I haven't tried a way to enable verbosity on pig (to get the stats exposed) but it's worth trying.

cheers,

@costin costin closed this Apr 12, 2014

@Foolius

This comment has been minimized.

Copy link
Author

commented Apr 13, 2014

After about 4 hours running the job failed again, with the same error message "out of nodes and retries".
Now I want to run it with logging enabled like you said, but I don't know what exactly I have to do, can you tell me the pig command for this?

@costin

This comment has been minimized.

Copy link
Member

commented Apr 13, 2014

@costin

This comment has been minimized.

Copy link
Member

commented Apr 14, 2014

@Foolius One thing to add, you can enable logging on the Pig module (see the docs I've attached) which will give info during the job run on whether a connection to a node fails,why and the next node being selected.
I'm not sure what the issue is in your case, it looks like some connection error but I'm unclear whether it occurs in ES or Hadoop. I recommend you check the logs on the ES side as well to see whether there's some time out occurring.
Additionally, try using a smaller bulk size - start with a smaller data set and especially make sure you know/control the number of tasks hitting ES since they might overload it. This shouldn't be a problem but without knowing what's going on is something to keep an eye on.

@costin costin reopened this Apr 14, 2014

@costin

This comment has been minimized.

Copy link
Member

commented Apr 14, 2014

Minor correction, for network related issues, enable logging on the org.elasticsearch.hadoop.rest module. Note that any connection problems are actually reported as ERRORs to the console anyway.

@Foolius

This comment has been minimized.

Copy link
Author

commented Apr 16, 2014

OK, so I'm a bit overwhelmed by the many different places and configurations for the logs. Additionally, I start to get different error messages which are perhaps problems on my side. One says, that the json parsing fails because the json is not correct, then I had errors, I think were because of some shards were lost.
Another error outputs this: https://gist.github.com/Foolius/10840474
I think this is the most interesting for you.

@costin

This comment has been minimized.

Copy link
Member

commented Apr 16, 2014

Yeah, debugging Hadoop isn't easy and unfortunately, es-hadoop (as every other piece of code running on top of it) relies on the same infrastructure. I see the nodes are failing one by one - one thing that stood out was the port - 9201 - do you set this yourself? Typically it should be 9200.
Have you disabled http by any chance? Is ES accessible from the Hadoop cluster - can you check this from the command line - for example trigger a curl command from one of the nodes using the IPs/ports in the logs?

I plan to add a bit more debugging in master - such as the underlying exception to provide some more context (though in many cases it's fairly generic).

@Foolius

This comment has been minimized.

Copy link
Author

commented Apr 16, 2014

The master node can access the nodes and the nodes can access each other, at least when the index is not running. Perhaps the network gets flooded and then it can't connect?
I got another error message: https://gist.github.com/Foolius/10879114
But this doesn't let the whole job fail. For example right now I'm running this with 32 files, an index with 32 shards and 8 nodes. So it starts 32 map jobs. 2 of them got this exception

@costin

This comment has been minimized.

Copy link
Member

commented Apr 16, 2014

It looks like some of your files are invalid - the parsing error in this case indicates that ES received an invalid JSON document. I'm not clear what type of network you are running but I highly doubt it gets flooded to the point where a basic HTTP connection times out.

Note that es-hadoop does not and cannot limit the number of tasks writing to ES - only the reads. If the files are split-able you might have more than 32 tasks running concurrently.
Is there anything happening on ES front? What version are you using? Try using Marvel to see how the job is doing during ingestion.
Maybe there's a GC kicking in though that should affect only one node not all of them at once. The timeout is about 1m so with the fallback, it looks like the cluster goes black for about 6-7m which suggests something else.

@Foolius

This comment has been minimized.

Copy link
Author

commented Apr 16, 2014

I will have to see if I can install Marvel.

The files are from the official twitter api, but there is sometimes things like "Connection established." or "Receiving status stream.". But I thought, ES will still index this without complaining.

After some experimentation I think this error( https://gist.github.com/Foolius/10879114 ) is significant. Now even when I index one single file, this exception is raised. And when I ran the job with the configuration I described above again, 97 maps fail with this exception in a few seconds and the whole job fails.

@costin

This comment has been minimized.

Copy link
Member

commented Apr 16, 2014

Right now the default behaviour is to fail the entire job if there's any type of unrecoverable exception happening.
What happens in your case is that there's at least one invalid entry in your input which causes the job to fail. There's work undergoing to allow a more lenient behaviour but until that happens, take a look at your file and try to do some basic JSON validation (there are various tools that can do that).
Note that the ES exception indicates the invalid bytes received which you can convert into chars and use to search the error.

Hope this helps,

@Foolius

This comment has been minimized.

Copy link
Author

commented Apr 16, 2014

Yes, that helps.

I will see how to fix this.

Thank you very much for your patience.

Edit: I forgot: I tried another file which probably has valid json it right now it still runs :)

costin added a commit that referenced this issue Apr 16, 2014

@costin

This comment has been minimized.

Copy link
Member

commented Apr 16, 2014

Hi,

I've added some minor tweaks in master which increase the amount of info given in the exception.
For each host that fails, you know also get some information about the error message. Additionally, for parsing errors
from ES, one also gets the String representation of the byte array.
For example, your error indicates that the string 'Stream closed.' was actually sent to ES.

Can you please try out master and post a gist with the logs?

P.S. I've already pushed the SNAPSHOTs in Maven so you can simply get the 1.3.0.BUILD-SNAPSHOT

Thanks,

On 4/16/14 5:49 PM, Foolius wrote:

Yes, that helps.

I will see how to fix this.

Thank you very much for your patience.


Reply to this email directly or view it on GitHub
#188 (comment).

Costin

@Foolius

This comment has been minimized.

Copy link
Author

commented Apr 17, 2014

Hi,
I used elasticsearch-hadoop-1.3.0.BUILD-20140416.180837-387.jar
and got this error message:
https://gist.github.com/Foolius/10973879
Which looks the same to me, did I use the wrong build?

But I can solve this issue easily by using the elephant-bird library.

@costin

This comment has been minimized.

Copy link
Member

commented Apr 17, 2014

Hi - there was an issue with the published update, I've pushed a fix (the jar name is
elasticsearch-hadoop-1.3.0.BUILD-20140417.223030-390.jar).

Can you please try it out again?

Thanks,

On 4/17/14 2:00 PM, Foolius wrote:

Hi,
I used elasticsearch-hadoop-1.3.0.BUILD-20140416.180837-387.jar
and got this error message:
https://gist.github.com/Foolius/10973879
Which looks the same to me, did I use the wrong build?

But I can solve this issue easily by using the elephant-bird library.


Reply to this email directly or view it on GitHub
#188 (comment).

Costin

@Foolius

This comment has been minimized.

Copy link
Author

commented Apr 25, 2014

I still get the same error message, with the jar you mentioned.

@costin

This comment has been minimized.

Copy link
Member

commented Apr 25, 2014

@Foolius I recommend trying the latest dev snapshot and reporting back the error. The error message, if there is still any, should be properly extracted. Also it would be useful to understand what the problem actually is and potentially create a new issue (as this is not about 'out of nodes' any more but rather about some invalid json by the looks of it).

Cheers,

@costin costin closed this May 8, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.