New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication fails with indeterminate error in basic configuration #3145

Closed
eonnen opened this Issue Jun 6, 2013 · 27 comments

Comments

Projects
None yet
@eonnen

eonnen commented Jun 6, 2013

Officially filing what I've seen here: https://groups.google.com/forum/#!topic/elasticsearch/LPv_wPPVTJg

In short, I'm attempting to configure a basic 2-node cluster with ES 0.90.0 on JRE 1.7.0_17, dedicated hardware, slightly different Linux distros. The baseline configuration isn't too complicated, simply 16 shards with predefined mappings and one index. Regardless of discovery mechanism, the nodes seem to discover/elect fine. If I run a single node and load it with the exact same harness, everything works fine. When I have two nodes, system errors in replication begin to happen but I'm not able to tell what the errors are due to what appears to be errors in the Netty transport layer being unable to parse the serialized exception. Stack traces appear similar to:

[2013-06-05 09:35:20,480][WARN ][action.index             ] [es-1] Failed to perform index on replica [rules][4]
org.elasticsearch.transport.RemoteTransportException: Failed to deserialize exception response from stream
Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize exception response from stream
    at org.elasticsearch.transport.netty.MessageChannelHandler.handlerResponseError(MessageChannelHandler.java:171)
    at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:125)
    at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
    at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
    at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
    at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
    at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
    at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
    at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
    at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:107)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88)
    at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
    at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
    at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
Caused by: java.io.StreamCorruptedException: unexpected end of block data
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1369)
    at java.io.ObjectInputStream.access$300(ObjectInputStream.java:205)
    at java.io.ObjectInputStream$GetFieldImpl.readFields(ObjectInputStream.java:2132)
    at java.io.ObjectInputStream.readFields(ObjectInputStream.java:537)
    at java.net.InetSocketAddress.readObject(InetSocketAddress.java:282)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:601)
    at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1004)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1872)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1894)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1894)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
    at org.elasticsearch.transport.netty.MessageChannelHandler.handlerResponseError(MessageChannelHandler.java:169)
    ... 23 more
[2013-06-05 09:35:20,483][WARN ][cluster.action.shard     ] [es-1] sending failed shard for [rules][4], node[Ed7LBQdzQMae69IzlCm-Dg], [R], s[STARTED], reason [Failed to perform [index] on replica, message [RemoteTransportException[Failed to deserialize exception response from stream]; nested: TransportSerializationException[Failed to deserialize exception response from stream]; nested: StreamCorruptedException[unexpected end of block data]; ]]
[2013-06-05 09:35:20,483][WARN ][cluster.action.shard     ] [es-1] received shard failed for [rules][4], node[Ed7LBQdzQMae69IzlCm-Dg], [R], s[STARTED], reason [Failed to perform [index] on replica, message [RemoteTransportException[Failed to deserialize exception response from stream]; nested: TransportSerializationException[Failed to deserialize exception response from stream]; nested: StreamCorruptedException[unexpected end of block data]; ]]

As additional context, I originally started with compression enabled and then disabled it. I've tested with both ZooKeeper and Zen discovery (unicast and multicast), same results. I've deleted the index as well as starting completely fresh, no change.

I'll follow up with config files after I sanitize them. Haven't made any progress asking on the mailing list.

@eonnen

This comment has been minimized.

Show comment
Hide comment
@eonnen

eonnen Jun 6, 2013

One additional data point I've discovered, this issue is only reproducible when generating concurrent load with multiple clients. If I drop my clients down to a single thread, no problems. When I step up to two threads executing index calls (using TransportClient), I can reproduce the issue.

eonnen commented Jun 6, 2013

One additional data point I've discovered, this issue is only reproducible when generating concurrent load with multiple clients. If I drop my clients down to a single thread, no problems. When I step up to two threads executing index calls (using TransportClient), I can reproduce the issue.

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Jun 7, 2013

Contributor

hi eric,

this is interesting though, is it possible to reproduce this error with a certain document you could share or does the document not matter at all here? as far as I understand you are indexing into 2 nodes with transport client and as soon as you switch to TransportClient and concurrent requests you see the problems happening, right? Yet, do you get exceptions in your client as well or do you see failing index requests? Is there any chance that your client uses a different es version by any chance?

Contributor

s1monw commented Jun 7, 2013

hi eric,

this is interesting though, is it possible to reproduce this error with a certain document you could share or does the document not matter at all here? as far as I understand you are indexing into 2 nodes with transport client and as soon as you switch to TransportClient and concurrent requests you see the problems happening, right? Yet, do you get exceptions in your client as well or do you see failing index requests? Is there any chance that your client uses a different es version by any chance?

@jprante

This comment has been minimized.

Show comment
Hide comment
@jprante

jprante Jun 7, 2013

Contributor

Duplicate older thread here (no solution yet) https://groups.google.com/d/msg/elasticsearch/MSrKvfgKwy0/Tfk6nhlqYxYJ

Contributor

jprante commented Jun 7, 2013

Duplicate older thread here (no solution yet) https://groups.google.com/d/msg/elasticsearch/MSrKvfgKwy0/Tfk6nhlqYxYJ

@eonnen

This comment has been minimized.

Show comment
Hide comment
@eonnen

eonnen Jun 7, 2013

Hey Simon! You're correct, I see no exceptions at the client layer and all is fine. I'm indexing a stream of very similar documents and when I only run a single indexer thread, I encounter no problems. As soon as I change to multiple threads on the client I can reproduce the issue. Given that I have no problems single threaded, I'm fairly certain this isn't a document-specific issue. I can't really pinpoint a specific problematic document as the exception doesn't make it back to the client.

eonnen commented Jun 7, 2013

Hey Simon! You're correct, I see no exceptions at the client layer and all is fine. I'm indexing a stream of very similar documents and when I only run a single indexer thread, I encounter no problems. As soon as I change to multiple threads on the client I can reproduce the issue. Given that I have no problems single threaded, I'm fairly certain this isn't a document-specific issue. I can't really pinpoint a specific problematic document as the exception doesn't make it back to the client.

@eonnen

This comment has been minimized.

Show comment
Hide comment
@eonnen

eonnen Jun 7, 2013

Oh and yes, I am using the same client version as the server assuming the maven version of the client is correct.

eonnen commented Jun 7, 2013

Oh and yes, I am using the same client version as the server assuming the maven version of the client is correct.

@eonnen

This comment has been minimized.

Show comment
Hide comment
@eonnen

eonnen Jun 7, 2013

With further testing, the issue does not seem to present itself when I disable the geo_point field from my mappings.

eonnen commented Jun 7, 2013

With further testing, the issue does not seem to present itself when I disable the geo_point field from my mappings.

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Jun 7, 2013

Contributor

man, that is good input! I will check this soonish! thanks man... will report back once I have something

Contributor

s1monw commented Jun 7, 2013

man, that is good input! I will check this soonish! thanks man... will report back once I have something

@jprante

This comment has been minimized.

Show comment
Hide comment
@jprante

jprante Jun 10, 2013

Contributor

A popular reason of this kind of error is a server-side exception generated by an optional jar, which is not present at client side. Geo uses the optional spatial4j jar so my recommendation is to add the spatial4j jar to the client classpath.
This also holds for all kind of plugins, if plugin jars are not present in client classpath and throw exceptions, they might fail while decoding.
Another option is to add more logging statements on server side so it is easier to track runtime execptions in the server logs.

Contributor

jprante commented Jun 10, 2013

A popular reason of this kind of error is a server-side exception generated by an optional jar, which is not present at client side. Geo uses the optional spatial4j jar so my recommendation is to add the spatial4j jar to the client classpath.
This also holds for all kind of plugins, if plugin jars are not present in client classpath and throw exceptions, they might fail while decoding.
Another option is to add more logging statements on server side so it is easier to track runtime execptions in the server logs.

@eonnen

This comment has been minimized.

Show comment
Hide comment
@eonnen

eonnen Jun 11, 2013

@jprante, doesn't the TransportClient simply submit a JSON representation of a document with nested fields? I'm not sure I understand why the client needs to have the ability to create a geospatial representation of what is effectively a text document. Regardless, the client's classpath was that of the POM which includes lucene-spatial and as a dependency spatial4j.

eonnen commented Jun 11, 2013

@jprante, doesn't the TransportClient simply submit a JSON representation of a document with nested fields? I'm not sure I understand why the client needs to have the ability to create a geospatial representation of what is effectively a text document. Regardless, the client's classpath was that of the POM which includes lucene-spatial and as a dependency spatial4j.

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Jun 11, 2013

Contributor

@eonnen I had no luck reproducing this one yesterday but I will give it another go tomorrow. Yet, the transport client still talks binary and not json via HTTP

Contributor

s1monw commented Jun 11, 2013

@eonnen I had no luck reproducing this one yesterday but I will give it another go tomorrow. Yet, the transport client still talks binary and not json via HTTP

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Jun 13, 2013

Contributor

@eonnen I used 0.90 and started 2 nodes on localhost:9200 and localhost:9201 (transport 9300 / 9301 respectively) Then I ran https://gist.github.com/s1monw/5774329 and I don't see any issues. Can you try to run this against your setup or extend it so that it reproduces?

simon

Contributor

s1monw commented Jun 13, 2013

@eonnen I used 0.90 and started 2 nodes on localhost:9200 and localhost:9201 (transport 9300 / 9301 respectively) Then I ran https://gist.github.com/s1monw/5774329 and I don't see any issues. Can you try to run this against your setup or extend it so that it reproduces?

simon

@tkurki

This comment has been minimized.

Show comment
Hide comment
@tkurki

tkurki Oct 15, 2013

I am seeing something similar here, but my case is really different: client throws

Caused by: java.io.EOFException
    at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2597)
    at java.io.ObjectInputStream.skipCustomData(ObjectInputStream.java:1942)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1916)
...
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
    at org.elasticsearch.transport.netty.MessageChannelHandler.handlerResponseError(MessageChannelHandler.java:301)

when doing simple search.

Digging strings from the data in the buffer just previous to trying to read it with ThrowableObjectInputStream shows stuff like

4org.elasticsearch.transport.RemoteTransportExceptionxr
4org.elasticsearch.transport.ActionTransportExceptionxr
.org.elasticsearch.transport.TransportExceptionxr
(org.elasticsearch.ElasticSearchExceptionxr
java.lang.RuntimeExceptionxr
java.lang.Throwablexpsr
/org.elasticsearch.indices.IndexMissingExceptionxr
&org.elasticsearch.index.IndexExceptionxq

so in my case it looks like a simple case of missing index, but reporting the error is breaking.

Only it isn't, because the index is there and you can search it via http json...

I have been previously burned by trying to return serialized Exceptions over the wire and the client failing because of some incompatibility between server and client side classpaths. I think the prudent way of handling server side errors is to return String based well defined Error objects and logging stacktraces on the server side. Otherwise this type of "sorry, we failed while failing" can crop up.

Haven't looked at the code yet though, it is possible I'm barking up the wrong tree.

tkurki commented Oct 15, 2013

I am seeing something similar here, but my case is really different: client throws

Caused by: java.io.EOFException
    at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2597)
    at java.io.ObjectInputStream.skipCustomData(ObjectInputStream.java:1942)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1916)
...
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
    at org.elasticsearch.transport.netty.MessageChannelHandler.handlerResponseError(MessageChannelHandler.java:301)

when doing simple search.

Digging strings from the data in the buffer just previous to trying to read it with ThrowableObjectInputStream shows stuff like

4org.elasticsearch.transport.RemoteTransportExceptionxr
4org.elasticsearch.transport.ActionTransportExceptionxr
.org.elasticsearch.transport.TransportExceptionxr
(org.elasticsearch.ElasticSearchExceptionxr
java.lang.RuntimeExceptionxr
java.lang.Throwablexpsr
/org.elasticsearch.indices.IndexMissingExceptionxr
&org.elasticsearch.index.IndexExceptionxq

so in my case it looks like a simple case of missing index, but reporting the error is breaking.

Only it isn't, because the index is there and you can search it via http json...

I have been previously burned by trying to return serialized Exceptions over the wire and the client failing because of some incompatibility between server and client side classpaths. I think the prudent way of handling server side errors is to return String based well defined Error objects and logging stacktraces on the server side. Otherwise this type of "sorry, we failed while failing" can crop up.

Haven't looked at the code yet though, it is possible I'm barking up the wrong tree.

@fizx

This comment has been minimized.

Show comment
Hide comment
@fizx

fizx Oct 31, 2013

This error can easily be duplicated by starting a cluster with some nodes running JRE 1.7.0_17, and other nodes running JRE 1.7.0_45.

fizx commented Oct 31, 2013

This error can easily be duplicated by starting a cluster with some nodes running JRE 1.7.0_17, and other nodes running JRE 1.7.0_45.

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Oct 31, 2013

Member

This happens because sadly, Java broke backward compatibility between 1.7.0_17 and 1.7.0_2x in how InetAddress gets serialized over the network. The only place in ES where we use Java serialization is when we serialize exceptions (since we don't want to write custom serializers for each one), and we use exception for state management (i.e. try to perform an operation on a node that is not ready). Sadly, part of those exception is an InetAddress, which causes this serialization problem...

There is nothing much we can do with current versions, except for asking to make sure not to run on mixed <=1.7.0_17 and >= 1.7.0_2x versions in the same cluster....

Member

kimchy commented Oct 31, 2013

This happens because sadly, Java broke backward compatibility between 1.7.0_17 and 1.7.0_2x in how InetAddress gets serialized over the network. The only place in ES where we use Java serialization is when we serialize exceptions (since we don't want to write custom serializers for each one), and we use exception for state management (i.e. try to perform an operation on a node that is not ready). Sadly, part of those exception is an InetAddress, which causes this serialization problem...

There is nothing much we can do with current versions, except for asking to make sure not to run on mixed <=1.7.0_17 and >= 1.7.0_2x versions in the same cluster....

@JacekLach

This comment has been minimized.

Show comment
Hide comment
@JacekLach

JacekLach Nov 21, 2013

FYI: this also happens on java6 between 1.6.0_43 and 1.6.0_45.

JacekLach commented Nov 21, 2013

FYI: this also happens on java6 between 1.6.0_43 and 1.6.0_45.

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Nov 29, 2013

Member

grrr, so annoying..., though we don't really have control over it, apologies....

Member

kimchy commented Nov 29, 2013

grrr, so annoying..., though we don't really have control over it, apologies....

@josegonzalez

This comment has been minimized.

Show comment
Hide comment
@josegonzalez

josegonzalez Jan 27, 2014

Contributor

I just ran into this issue. Shards are constantly reallocating themselves and failing :(

Contributor

josegonzalez commented Jan 27, 2014

I just ran into this issue. Shards are constantly reallocating themselves and failing :(

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Jan 27, 2014

Member

@josegonzalez make sure you run the same Java version across nodes?

Member

kimchy commented Jan 27, 2014

@josegonzalez make sure you run the same Java version across nodes?

@josegonzalez

This comment has been minimized.

Show comment
Hide comment
@josegonzalez

josegonzalez Jan 27, 2014

Contributor

Yeah I fixed it. Here's what happened in my case:

We brought up a new node for our api elasticsearch cluster, and this new node had Java 1.7.0_51. The old nodes ran 1.7.0_16 (I believe). I ended up updating java everywhere.

Once java is installed, elasticsearch is started. Installing java does not automatically restart elasticsearch, and neither does upgrading. So we ended up running two different versions for in the cluster, even though java -version produced 1.7.0_51 everywhere.

I had to restart all older nodes in the cluster, which then allowed replication to proceed as advertised.

Java should really have a sticker on it "please don't touch me, ok thx bye" :)

Might be worth mentioning this in the docs somewhere, maybe an FAQ.

Contributor

josegonzalez commented Jan 27, 2014

Yeah I fixed it. Here's what happened in my case:

We brought up a new node for our api elasticsearch cluster, and this new node had Java 1.7.0_51. The old nodes ran 1.7.0_16 (I believe). I ended up updating java everywhere.

Once java is installed, elasticsearch is started. Installing java does not automatically restart elasticsearch, and neither does upgrading. So we ended up running two different versions for in the cluster, even though java -version produced 1.7.0_51 everywhere.

I had to restart all older nodes in the cluster, which then allowed replication to proceed as advertised.

Java should really have a sticker on it "please don't touch me, ok thx bye" :)

Might be worth mentioning this in the docs somewhere, maybe an FAQ.

@matiwinnetou

This comment has been minimized.

Show comment
Hide comment
@matiwinnetou

matiwinnetou Feb 3, 2014

Would it make sense to replace standard Java Serialization with another serialization protocol?

matiwinnetou commented Feb 3, 2014

Would it make sense to replace standard Java Serialization with another serialization protocol?

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Feb 3, 2014

Member

@mati1979 we only use standard Java serialization for serializing exceptions, for most cases where the request/response are not exception driven, we use our own custom serialization

Member

kimchy commented Feb 3, 2014

@mati1979 we only use standard Java serialization for serializing exceptions, for most cases where the request/response are not exception driven, we use our own custom serialization

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Apr 12, 2014

Chiming in that this also breaks clients which use different JVM patchlevels, and I imagine it makes doing any kind of rolling upgrade something of a headache. Might consider serializing exceptions as a [classname-str message-str [stacktrace-line stacktrace-line ...]] structure, under the premise that some information is better than no information about the underlying error. If you're relying on try/catch dispatch for Elasticsch exception types, you can always define a custom deserializer for those; it's only the underlying exceptions you don't control that need a fallback mechanism.

aphyr commented Apr 12, 2014

Chiming in that this also breaks clients which use different JVM patchlevels, and I imagine it makes doing any kind of rolling upgrade something of a headache. Might consider serializing exceptions as a [classname-str message-str [stacktrace-line stacktrace-line ...]] structure, under the premise that some information is better than no information about the underlying error. If you're relying on try/catch dispatch for Elasticsch exception types, you can always define a custom deserializer for those; it's only the underlying exceptions you don't control that need a fallback mechanism.

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Apr 12, 2014

Member

there is a thread on the mailing list about it. This is the first time Java broke serialization for exceptions in the 4.5 years the codebase exists. While annoying, the question is if we should now write custom logic to serialize exceptions.

Exceptions hold more than just what you described, since they also hold variables and the class type itself, so your solution won't work. One option is to use something like Jackson object mapper to serialize the exceptions in a generic fashion. If we see this pattern repeating in java (at least one more time), then we will do it (which comes with other costs obviously, you still rely on a serialization).

Side note, you are running your servers on 1.7.0_03, make sure you don't do that as it is known to be a problematic java version (in many different aspects). Use _25.

Member

kimchy commented Apr 12, 2014

there is a thread on the mailing list about it. This is the first time Java broke serialization for exceptions in the 4.5 years the codebase exists. While annoying, the question is if we should now write custom logic to serialize exceptions.

Exceptions hold more than just what you described, since they also hold variables and the class type itself, so your solution won't work. One option is to use something like Jackson object mapper to serialize the exceptions in a generic fashion. If we see this pattern repeating in java (at least one more time), then we will do it (which comes with other costs obviously, you still rely on a serialization).

Side note, you are running your servers on 1.7.0_03, make sure you don't do that as it is known to be a problematic java version (in many different aspects). Use _25.

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Apr 12, 2014

I am not really an expert in these sorts of things, and of course defer to your judgement and expertise, but might I suggest defining a taxonomy of the errors which occur in your system, and having a formally specified, platform-independent representation of those errors on the wire--such that clients running on other JVMs or even clients from other runtimes could interpret those errors in a common fashion?

aphyr commented Apr 12, 2014

I am not really an expert in these sorts of things, and of course defer to your judgement and expertise, but might I suggest defining a taxonomy of the errors which occur in your system, and having a formally specified, platform-independent representation of those errors on the wire--such that clients running on other JVMs or even clients from other runtimes could interpret those errors in a common fashion?

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Apr 12, 2014

Member

that would be nice!, and something that potentially will be worth the investment to do.

The taxonomy of external errors today for external clients follows the rest status codes, as most of our clients talk to ES through HTTP. In JVM based languages using the java client, each error can be cast to an http response code.

For internal errors, or node (client) to node erros the idea behind using Java serialization for exceptions is to try and preserve the full information of the exception and propagate it through the chain of nodes that handles the request. Those exceptions many times come from low level operations like network or file system. This allows for example to give detailed information on why something failed not necessarily on the node that it actually failed on (this helps a lot when it comes to debugging).

For example, if a shard failed, the master node will log that information without needing to go and chase where that shard failed to try and understand why it failed.

Last, the other aspect that makes it super hard is the fact that Java doesn't have a good taxonomy of exceptions. Just check IOException :). At least in that case, by not loosing the type and info, we can potentially do better not only where the actual type is around...

Member

kimchy commented Apr 12, 2014

that would be nice!, and something that potentially will be worth the investment to do.

The taxonomy of external errors today for external clients follows the rest status codes, as most of our clients talk to ES through HTTP. In JVM based languages using the java client, each error can be cast to an http response code.

For internal errors, or node (client) to node erros the idea behind using Java serialization for exceptions is to try and preserve the full information of the exception and propagate it through the chain of nodes that handles the request. Those exceptions many times come from low level operations like network or file system. This allows for example to give detailed information on why something failed not necessarily on the node that it actually failed on (this helps a lot when it comes to debugging).

For example, if a shard failed, the master node will log that information without needing to go and chase where that shard failed to try and understand why it failed.

Last, the other aspect that makes it super hard is the fact that Java doesn't have a good taxonomy of exceptions. Just check IOException :). At least in that case, by not loosing the type and info, we can potentially do better not only where the actual type is around...

@ppuschmann

This comment has been minimized.

Show comment
Hide comment
@ppuschmann

ppuschmann Apr 30, 2014

Hi, I also ran in the same issue when applying security fixes for one server (1.6.0_18 vs. 1.6.0_31) running an ES 0.19.12 cluster (with only http-clients).

Very interesting to see that the one node with the wrong (higher) version-number had no entries in its log-files while others that want to replicate shards to "the one" had a lot of exceptions.

Is there a plan how to get this fixed? And how should we do an update of the JRE?

ppuschmann commented Apr 30, 2014

Hi, I also ran in the same issue when applying security fixes for one server (1.6.0_18 vs. 1.6.0_31) running an ES 0.19.12 cluster (with only http-clients).

Very interesting to see that the one node with the wrong (higher) version-number had no entries in its log-files while others that want to replicate shards to "the one" had a lot of exceptions.

Is there a plan how to get this fixed? And how should we do an update of the JRE?

@clintongormley

This comment has been minimized.

Show comment
Hide comment
@clintongormley

clintongormley Sep 19, 2015

Member

This has been implemented in #11910

Member

clintongormley commented Sep 19, 2015

This has been implemented in #11910

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment