Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On Solaris 10 (Illumos), setting TCP_NODELAY on a closed socket causes elasticsearch to be unresponsive #7115

Closed
f3nry opened this Issue Jul 31, 2014 · 10 comments

Comments

Projects
None yet
3 participants
@f3nry
Copy link

f3nry commented Jul 31, 2014

We're on ElasticSearch 1.1.1 running on Illumos (Solaris 10 derivative on Joyent).

We ran into an issue today where elasticsearch became completely unresponsive after the following exception:

[2014-07-31 12:30:18,081][WARN ][monitor.jvm              ] [HOSTNAME] [gc][young][3604571][140866] duration [1.9s], collections [1]/[2.2s], total [
1.9s]/[1.2h], memory [22.7gb]->[21.4gb]/[29.1gb], all_pools {[young] [1.3gb]->[29.2mb]/[1.4gb]}{[survivor] [70mb]->[55.3mb]/[191.3mb]}{[old] [21.3gb]->[21.3
gb]/[27.4gb]}
[2014-07-31 12:30:27,075][WARN ][monitor.jvm              ] [HOSTNAME] [gc][young][3604579][140869] duration [1.2s], collections [1]/[1.9s], total [
1.2s]/[1.2h], memory [22.3gb]->[21.2gb]/[29.1gb], all_pools {[young] [1.1gb]->[29.8mb]/[1.4gb]}{[survivor] [52.9mb]->[46.8mb]/[191.3mb]}{[old] [21.2gb]->[21
.2gb]/[27.4gb]}
[2014-07-31 12:30:35,954][WARN ][http.netty               ] [HOSTNAME] Caught exception while handling client http traffic, closing connection [id:
0x810b66dd, /IPSOURCE:48650 => /IPDEST:9200]
org.elasticsearch.common.netty.channel.ChannelException: java.net.SocketException: Invalid argument
        at org.elasticsearch.common.netty.channel.socket.DefaultSocketChannelConfig.setTcpNoDelay(DefaultSocketChannelConfig.java:178)
        at org.elasticsearch.common.netty.channel.socket.DefaultSocketChannelConfig.setOption(DefaultSocketChannelConfig.java:54)
        at org.elasticsearch.common.netty.channel.socket.nio.DefaultNioSocketChannelConfig.setOption(DefaultNioSocketChannelConfig.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelConfig.setOptions(DefaultChannelConfig.java:36)
        at org.elasticsearch.common.netty.channel.socket.nio.DefaultNioSocketChannelConfig.setOptions(DefaultNioSocketChannelConfig.java:54)
        at org.elasticsearch.common.netty.bootstrap.ServerBootstrap$Binder.childChannelOpen(ServerBootstrap.java:399)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:77)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.elasticsearch.common.netty.channel.Channels.fireChildChannelStateChanged(Channels.java:541)
        at org.elasticsearch.common.netty.channel.Channels.fireChannelOpen(Channels.java:167)
        at org.elasticsearch.common.netty.channel.socket.nio.NioAcceptedSocketChannel.<init>(NioAcceptedSocketChannel.java:42)
        at org.elasticsearch.common.netty.channel.socket.nio.NioServerBoss.registerAcceptedChannel(NioServerBoss.java:137)
        at org.elasticsearch.common.netty.channel.socket.nio.NioServerBoss.process(NioServerBoss.java:104)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
        at org.elasticsearch.common.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
        at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
Caused by: java.net.SocketException: Invalid argument
        at sun.nio.ch.Net.setIntOption0(Native Method)
        at sun.nio.ch.Net.setSocketOption(Net.java:373)
        at sun.nio.ch.SocketChannelImpl.setOption(SocketChannelImpl.java:189)
        at sun.nio.ch.SocketAdaptor.setBooleanOption(SocketAdaptor.java:295)
        at sun.nio.ch.SocketAdaptor.setTcpNoDelay(SocketAdaptor.java:330)
        at org.elasticsearch.common.netty.channel.socket.DefaultSocketChannelConfig.setTcpNoDelay(DefaultSocketChannelConfig.java:176)
        ... 20 more

On solaris, setsocketopt has different behavior that on other platforms. It will return EINVAL causing java to raise an InvalidArgument exception when the socket has been closed. Apparently this happens when the client closes the connection before the server has finished it's accept. Elasticsearch appears to have been doing a garbage collection around that time.

Here's a couple references to this bug occurring in other projects:

http://bugs.java.com/view_bug.do?bug_id=6378870
https://java.net/jira/browse/GLASSFISH-5342
https://jira.atlassian.com/browse/STASH-3624

It also appears that in Netty 4.0+ this might have been fixed by: netty/netty@39357f3#diff-dbfa6a222217d4fc2c12d20ee3496eb3R50

Unfortunately, this is a bit difficult to reproduce and it only happens rarely. I'd imagine it can by reproduced by running elasticsearch on Solaris 10, finding a way to stall the server long enough for the client to close the connection before the server has set the socket options. Elasticsearch search should then stall and stop responding to any requests (as is the behavior that we saw).

Thanks,
Paul

@f3nry f3nry changed the title On Solaris 10 (Illumos), setting TCP_NODELAY could cause elasticsearch to be unresponsive On Solaris 10 (Illumos), setting TCP_NODELAY on a closed socket causes elasticsearch to be unresponsive Jul 31, 2014

@kimchy

This comment has been minimized.

Copy link
Member

kimchy commented Jul 31, 2014

seems like on netty its not set by default only on Android, and still being set on Solaris. I would be more than happy to create a change to disable it on Solaris, others, thoughts?

@f3nry

This comment has been minimized.

Copy link
Author

f3nry commented Jul 31, 2014

@kimchy Correct, though it is now wrapped in an exception block and is ignored if it throws an error. TCP_NODELAY should still be set on Solaris but the behavior on a closed socket throws an exception and should be ignored.

@kimchy

This comment has been minimized.

Copy link
Member

kimchy commented Jul 31, 2014

Yes, the silent ignore in the exception... . I was just wondering why netty didn't disable it on Solaris by default as well. Based on your input, it seems like it should. I am reaching out to some solaris experts on our end to see what they think, just to be double sure we should make this change. Thanks for bringing it up!

@f3nry

This comment has been minimized.

Copy link
Author

f3nry commented Jul 31, 2014

Okay, awesome! Thanks so much!

@kimchy

This comment has been minimized.

Copy link
Member

kimchy commented Jul 31, 2014

@letuboy btw, which Java version are you running?

@kimchy

This comment has been minimized.

Copy link
Member

kimchy commented Jul 31, 2014

and another question, if you set it to false, does that happen (still gathering info, can probably find out on my own as well)? Its just the mere fact of calling setTcpNoDelay? If so, then we need not to set this setting at all on solaris, and at the very least, provide another setting to not set it (or another option, call it "default" to leave it as is)

@f3nry

This comment has been minimized.

Copy link
Author

f3nry commented Jul 31, 2014

We're using OpenJDK 1.7.

openjdk version "1.7.0-internal"
OpenJDK Runtime Environment (build 1.7.0-internal-pkgsrc_2014_05_16_23_21-b00)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

It appears that the mere fact of calling setTcpNoDelay causes this. It's also very rare, but it has happened a few times.

@kimchy

This comment has been minimized.

Copy link
Member

kimchy commented Aug 2, 2014

@letuboy hard to tell exactly which Java version its actually is..., internal?

kimchy added a commit to kimchy/elasticsearch that referenced this issue Aug 2, 2014

Support "default" for tcpNoDelay and tcpKeepAlive
Allow to set the value default to network.tcp.no_delay and network.tcp.keep_alive so they won't be set at all, since on solaris, setting tcpNoDelay can actually cause failure
relates to elastic#7115

kimchy added a commit that referenced this issue Aug 2, 2014

Support "default" for tcpNoDelay and tcpKeepAlive
Allow to set the value default to network.tcp.no_delay and network.tcp.keep_alive so they won't be set at all, since on solaris, setting tcpNoDelay can actually cause failure
relates to #7115

kimchy added a commit that referenced this issue Aug 2, 2014

Support "default" for tcpNoDelay and tcpKeepAlive
Allow to set the value default to network.tcp.no_delay and network.tcp.keep_alive so they won't be set at all, since on solaris, setting tcpNoDelay can actually cause failure
relates to #7115
@kimchy

This comment has been minimized.

Copy link
Member

kimchy commented Aug 2, 2014

I pushed #7136 to master and 1.x (upcoming 1.4) to allow to set default as the value, and then it will not be set. Its not a good out of the box solution, but at least now users will have the option to configure ES not to set it at all.

@f3nry

This comment has been minimized.

Copy link
Author

f3nry commented Aug 2, 2014

OpenJDK 1.7 correlates to Java 7, as far as I'm aware of. Thanks so much! We'll look out for the 1.4 release and update the setting when that happens. This issue is rare, so it shouldn't be too much of an pain until then. I'll close this ticket.

@sax @indirect

@f3nry f3nry closed this Aug 2, 2014

kimchy added a commit that referenced this issue Sep 8, 2014

Support "default" for tcpNoDelay and tcpKeepAlive
Allow to set the value default to network.tcp.no_delay and network.tcp.keep_alive so they won't be set at all, since on solaris, setting tcpNoDelay can actually cause failure
relates to #7115
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.