Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting "Failed to deserialize response from handler" after upgrading from 5.6.4 to 6.2.1 #28713

Closed
bra-fsn opened this issue Feb 17, 2018 · 7 comments
Labels
>bug :Data Management/Stats Statistics tracking and retrieval APIs

Comments

@bra-fsn
Copy link

bra-fsn commented Feb 17, 2018

Elasticsearch version (bin/elasticsearch --version):
Version: 6.2.1, Build: 7299dc3/2018-02-07T19:34:26.990113Z, JVM: 1.8.0_131

Plugins installed: []
analysis-icu

JVM version (java -version):
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)

OS version (uname -a if on a Unix-like system):
FreeBSD 11.1

Description of the problem including expected versus actual behavior:
After upgrading to 6.2.1 from 5.6.4 I get these transport errors between nodes (in this case between the master (10.6.145.102) and a data node (10.6.145.235).

Steps to reproduce:
Never saw this on previous elastic versions, it started to appear after the 6.2.1 upgrade. We have several nodes (three distinct masters, 40 mixed data and client nodes and some client only nodes, between them the java versions differ, masters: 1.8.0_51, data nodes: 1.8.0_144, client nodes: 1.8.0_131).
We have indices created with various 5.x versions and with a lot of different data.
Please tell me what exact info should I provide.

Provide logs (if relevant):
[2018-02-17T09:45:45,474][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [fmesm02] failed to execute on node [VuFL_GfOQau3I-ZYHbiW5g] org.elasticsearch.transport.RemoteTransportException: [Failed to deserialize response from handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler]] Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize response from handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler] at org.elasticsearch.transport.TcpTransport.handleResponse(TcpTransport.java:1441) [elasticsearch-6.2.1.jar:6.2.1] at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1400) [elasticsearch-6.2.1.jar:6.2.1] at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:64) [transport-netty4-6.2.1.jar:6.2.1] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) [netty-codec-4.1.16.Final.jar:4.1.16.Final] at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.16.Final.jar:4.1.16.Final] at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413) [netty-codec-4.1.16.Final.jar:4.1.16.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) [netty-codec-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:545) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.16.Final.jar:4.1.16.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131] Caused by: java.io.IOException: Invalid string; unexpected character: 249 hex: f9 at org.elasticsearch.common.io.stream.StreamInput.readString(StreamInput.java:375) ~[elasticsearch-6.2.1.jar:6.2.1] at org.elasticsearch.node.ResponseCollectorService$ComputedNodeStats.<init>(ResponseCollectorService.java:145) ~[elasticsearch-6.2.1.jar:6.2.1] at org.elasticsearch.common.io.stream.StreamInput.readMap(StreamInput.java:460) ~[elasticsearch-6.2.1.jar:6.2.1] at org.elasticsearch.node.AdaptiveSelectionStats.<init>(AdaptiveSelectionStats.java:56) ~[elasticsearch-6.2.1.jar:6.2.1] at org.elasticsearch.common.io.stream.StreamInput.readOptionalWriteable(StreamInput.java:733) ~[elasticsearch-6.2.1.jar:6.2.1] at org.elasticsearch.action.admin.cluster.node.stats.NodeStats.readFrom(NodeStats.java:239) ~[elasticsearch-6.2.1.jar:6.2.1] at org.elasticsearch.transport.TransportResponseHandler.read(TransportResponseHandler.java:47) ~[elasticsearch-6.2.1.jar:6.2.1] at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.read(TransportService.java:1085) ~[elasticsearch-6.2.1.jar:6.2.1] at org.elasticsearch.transport.TcpTransport.handleResponse(TcpTransport.java:1437) ~[elasticsearch-6.2.1.jar:6.2.1] at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1400) [elasticsearch-6.2.1.jar:6.2.1] at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:64) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?] at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) ~[?:?] at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297) ~[?:?] at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413) ~[?:?] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?] at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?] at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?] at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) ~[?:?] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) ~[?:?] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) ~[?:?] at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:545) ~[?:?] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499) ~[?:?] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) ~[?:?] at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_131] [2018-02-17T09:45:45,487][WARN ][o.e.t.n.Netty4Transport ] [fmesm02] exception caught on transport layer [NettyTcpChannel{localAddress=/10.6.145.102:13476, remoteAddress=10.6.145.235/10.6.145.235:9300}], closing connection java.lang.IllegalStateException: Message not fully read (response) for requestId [3185238], handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1@3c93bec2], error [false]; resetting at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1407) ~[elasticsearch-6.2.1.jar:6.2.1] at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:64) ~[transport-netty4-6.2.1.jar:6.2.1] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) [netty-codec-4.1.16.Final.jar:4.1.16.Final] at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.16.Final.jar:4.1.16.Final] at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413) [netty-codec-4.1.16.Final.jar:4.1.16.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) [netty-codec-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:545) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) [netty-transport-4.1.16.Final.jar:4.1.16.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.16.Final.jar:4.1.16.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131] [2018-02-17T09:45:45,548][INFO ][o.e.c.r.a.AllocationService] [fmesm02] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{fmfe30}{VuFL_GfOQau3I-ZYHbiW5g}{VGhsT_tCRpCngX3cTFSZ0w}{10.6.145.235}{10.6.145.235:9300} transport disconnected]).

@bra-fsn
Copy link
Author

bra-fsn commented Feb 17, 2018

The unexpected character varies wildly:
Caused by: java.io.IOException: Invalid string; unexpected character: 146 hex: 92
Caused by: java.io.IOException: Invalid string; unexpected character: 159 hex: 9f
Caused by: java.io.IOException: Invalid string; unexpected character: 242 hex: f2
Caused by: java.io.IOException: Invalid string; unexpected character: 137 hex: 89
Caused by: java.io.IOException: Invalid string; unexpected character: 186 hex: ba
Caused by: java.io.IOException: Invalid string; unexpected character: 251 hex: fb
Caused by: java.io.IOException: Invalid string; unexpected character: 249 hex: f9
Caused by: java.io.IOException: Invalid string; unexpected character: 189 hex: bd
Caused by: java.io.IOException: Invalid string; unexpected character: 153 hex: 99
Caused by: java.io.IOException: Invalid string; unexpected character: 254 hex: fe
Caused by: java.io.IOException: Invalid string; unexpected character: 139 hex: 8b

@ywelsch
Copy link
Contributor

ywelsch commented Feb 17, 2018

Are all nodes on 6.2.1? Is this issue only occurring between certain nodes? Is the response different if you fire off a GET /_nodes/stats request on the master-eligible, the data, or the client nodes? Are all nodes but one reporting a failure in that case, or only a subset of the nodes?
Before we get into tcp dump territory, can you maybe reproduce this in a simple 2 node cluster (same hardware/OS)?

@bra-fsn
Copy link
Author

bra-fsn commented Feb 17, 2018

Yes, all nodes are 6.2.1:
# curl -s 'http://localhost:9200/_cat/nodes?h=node.role,version' | sort | uniq -c 40 di 6.2.1 9 i 6.2.1 3 mi 6.2.1
They only differ in their roles and JDK versions.

I can only see these log entries on the three master(eliglible) nodes. BTW, we query only those for statistics.
There are many nodes involved, until now 29 of them, each of them are data (and client, mixed) nodes:
egrep 'exception caught on transport layer' all_masters.log | egrep -o 'remoteAddress.*}' | sort | uniq -c 2 remoteAddress=10.6.145.193/10.6.145.193:9300} 1 remoteAddress=10.6.145.195/10.6.145.195:9300} 2 remoteAddress=10.6.145.197/10.6.145.197:9300} 1 remoteAddress=10.6.145.199/10.6.145.199:9300} 4 remoteAddress=10.6.145.200/10.6.145.200:9300} 1 remoteAddress=10.6.145.201/10.6.145.201:9300} 1 remoteAddress=10.6.145.202/10.6.145.202:9300} 1 remoteAddress=10.6.145.203/10.6.145.203:9300} 1 remoteAddress=10.6.145.204/10.6.145.204:9300} 1 remoteAddress=10.6.145.206/10.6.145.206:9300} 1 remoteAddress=10.6.145.210/10.6.145.210:9300} 1 remoteAddress=10.6.145.211/10.6.145.211:9300} 1 remoteAddress=10.6.145.223/10.6.145.223:9300} 2 remoteAddress=10.6.145.225/10.6.145.225:9300} 1 remoteAddress=10.6.145.226/10.6.145.226:9300} 3 remoteAddress=10.6.145.227/10.6.145.227:9300} 3 remoteAddress=10.6.145.228/10.6.145.228:9300} 1 remoteAddress=10.6.145.232/10.6.145.232:9300} 2 remoteAddress=10.6.145.233/10.6.145.233:9300} 1 remoteAddress=10.6.145.234/10.6.145.234:9300} 3 remoteAddress=10.6.145.235/10.6.145.235:9300} 3 remoteAddress=10.6.145.236/10.6.145.236:9300} 1 remoteAddress=10.6.145.237/10.6.145.237:9300} 1 remoteAddress=10.6.145.239/10.6.145.239:9300} 3 remoteAddress=10.6.145.240/10.6.145.240:9300} 2 remoteAddress=10.6.145.241/10.6.145.241:9300} 2 remoteAddress=10.6.145.242/10.6.145.242:9300} 2 remoteAddress=10.6.145.243/10.6.145.243:9300} 2 remoteAddress=10.6.145.244/10.6.145.244:9300}
It happens more often with some of them as you can see, but I can't point one as guilty.
It also happens on master-eliglible, but non-master nodes, the above contains their logs too.

Today the log dates are:
[2018-02-17T00:15:46,519][WARN [2018-02-17T00:21:08,014][WARN [2018-02-17T02:16:04,699][WARN [2018-02-17T02:22:18,086][WARN [2018-02-17T02:25:28,047][WARN [2018-02-17T02:31:26,304][WARN [2018-02-17T02:35:43,636][WARN [2018-02-17T02:50:40,803][WARN [2018-02-17T02:55:27,373][WARN [2018-02-17T03:30:54,778][WARN [2018-02-17T03:55:15,757][WARN [2018-02-17T05:40:33,378][WARN [2018-02-17T05:40:48,091][WARN [2018-02-17T06:10:37,117][WARN [2018-02-17T06:10:49,796][WARN [2018-02-17T06:56:14,371][WARN [2018-02-17T07:11:15,700][WARN [2018-02-17T07:26:03,577][WARN [2018-02-17T07:30:39,050][WARN [2018-02-17T07:36:26,556][WARN [2018-02-17T08:06:04,656][WARN [2018-02-17T08:45:43,094][WARN [2018-02-17T08:55:47,759][WARN [2018-02-17T09:00:32,387][WARN [2018-02-17T09:45:45,487][WARN [2018-02-17T10:15:53,099][WARN [2018-02-17T10:20:38,481][WARN [2018-02-17T10:30:29,252][WARN [2018-02-17T11:00:27,803][WARN [2018-02-17T11:10:28,640][WARN [2018-02-17T11:50:43,091][WARN [2018-02-17T12:00:48,460][WARN [2018-02-17T12:35:25,549][WARN [2018-02-17T12:35:32,719][WARN [2018-02-17T12:50:30,927][WARN [2018-02-17T13:00:51,557][WARN [2018-02-17T13:35:25,834][WARN [2018-02-17T14:15:54,408][WARN [2018-02-17T14:25:36,535][WARN [2018-02-17T14:25:46,726][WARN [2018-02-17T15:10:28,747][WARN [2018-02-17T15:15:29,193][WARN [2018-02-17T15:50:16,142][WARN [2018-02-17T16:00:48,446][WARN [2018-02-17T16:00:57,826][WARN [2018-02-17T16:10:23,455][WARN [2018-02-17T16:20:25,079][WARN [2018-02-17T16:35:34,404][WARN [2018-02-17T16:40:28,779][WARN [2018-02-17T17:05:34,549][WARN
Most of them are close to 0 and 5 minutes which correlates with out statistics collection cronjob.
It doesn't happen on every _nodes/stats query. I've tried to fire some queries by hand, but nothing happened. Will try running the stats collector more often along with a tcpdump.

@bra-fsn
Copy link
Author

bra-fsn commented Feb 17, 2018

I could capture an error, I sent the dump and log in email to you.

@ywelsch
Copy link
Contributor

ywelsch commented Feb 18, 2018

Thanks for the packet capture. I've had a look and I think I know what's going on.

The AdaptiveSelectionStats object (introduced in 6.1) serializes the clientOutgoingConnections map that's concurrently updated in SearchTransportService. Serializing the map consists of first writing the size of the map and then serializing the entries. If the number of entries changes while the map is being serialized, the size and number of entries go out of sync. The deserialization routine expects those to be in sync though. I've opened #28718 as a fix.

@ywelsch ywelsch added >bug :Data Management/Stats Statistics tracking and retrieval APIs labels Feb 18, 2018
@bra-fsn
Copy link
Author

bra-fsn commented Feb 19, 2018

The cluster is fine since yesterday with your latest patch:
ef03a0e
Thanks for the quick fix!

@bra-fsn bra-fsn closed this as completed Feb 19, 2018
ywelsch added a commit that referenced this issue Feb 19, 2018
The AdaptiveSelectionStats object serializes the clientOutgoingConnections map that's concurrently updated in SearchTransportService. Serializing the map consists of first writing the size of the map and then serializing the entries. If the number of entries changes while the map is being serialized, the size and number of entries go out of sync. The deserialization routine expects those to be in sync though.

Closes #28713
ywelsch added a commit that referenced this issue Feb 19, 2018
The AdaptiveSelectionStats object serializes the clientOutgoingConnections map that's concurrently updated in SearchTransportService. Serializing the map consists of first writing the size of the map and then serializing the entries. If the number of entries changes while the map is being serialized, the size and number of entries go out of sync. The deserialization routine expects those to be in sync though.

Closes #28713
ywelsch added a commit that referenced this issue Feb 19, 2018
The AdaptiveSelectionStats object serializes the clientOutgoingConnections map that's concurrently updated in SearchTransportService. Serializing the map consists of first writing the size of the map and then serializing the entries. If the number of entries changes while the map is being serialized, the size and number of entries go out of sync. The deserialization routine expects those to be in sync though.

Closes #28713
@ywelsch
Copy link
Contributor

ywelsch commented Feb 19, 2018

Thanks for the detailed report and the packet dump. For anyone else running into this bug:
This can manifest when using the node stats call (GET /_nodes/stats), which queries adaptive replica selection stats by default, even if the feature is not used. The only workaround is to explicitly specify (all) the desired metrics os,jvm,thread_pool,... on the GET /_nodes/stats request to implicitly exclude the collection of the adaptive replica selection stats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Stats Statistics tracking and retrieval APIs
Projects
None yet
Development

No branches or pull requests

2 participants