New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixing SocketException handling in MulticastZenPing receiver #2783
Conversation
What exception do you actually get? SocketException doesn't necessarily means that its no longer running, it might be transient. |
The exception is java.net.SocketException: Socket closed See https://gist.github.com/jprante/5183980 The test is integration.recovery.FullRollingRestartTests#testFullRollingRestart |
thats strange, because we only close the socket after we set the running flag to true, so this should not be logged (its only logged if its running). Can you maybe check why its in that state (running on OSX 10.8 and it seems fine). |
I see this in the JavaDoc of java.net.DatagramSocket: "Any thread currently blocked in receive(java.net.DatagramPacket) upon this socket will throw a SocketException." http://docs.oracle.com/javase/7/docs/api/java/net/DatagramSocket.html |
Sure, but we only actually close the socket after we set running to false (check the |
I'm checking against a series of JDK versions now. Seems odd that a volatile variable can not be written to. |
What do you mean volatile variable can't be written to? which version does it fail on? |
I tried to replace the volatile |
Reworked patch, this one checks for "Socket closed" SocketException and thread interrupt state. Although I would like to prefer to always abort the Receiver thread if there is a SocketException, to avoid a loop. |
Hey Joerg, I am running into this issue as well, but not when running elasticsearch master with standalone tests, but when running the tests of my FST suggester plugin using the current elasticsearch version or master. However it seems, this issue occurs in my plugin due to a resource leak - are you using elasticsearch and testing some plugin with it or just running the tests on master? |
Just running on master, no plugins. Can you explain what was the resource leak? |
@jprante I don't like fixing things without understanding why they happen, and as I mentioned, a SocketException can happen not just because the socket is closed. I still would like to understand why the current logic fails before we try and fix something... |
This is really weird.. I can reproduce this on master... but only in about 20% of my test runs... Used this for a bit more automation:
I tried making the MulticastSocket volatile, but that does not change anything. Before going on, I'll test on a different operating system, as everyone involved uses a pretty up-to-date mac I guess. |
@kimchy Agreed. I have to investigate with extra test code and variations of JVM parameters. Maybe Mac OS X JVM specific. |
@spinscale I will also test on Solaris, Linux, and Windows. |
Ubuntu with the .17 jdk looks stable. |
From my tests I am quite sure now that executing the test exceeds the default Mac OS X 10.8 process limit of open files. It always breaks at a sudden when the 5th ES node is started up in FullRollingRestartTests. Increasing the process file limit with Interestingly, there is no "too many open files" error thrown. It seems to get hidden somewhere in the internals of Mac OS X JVM. Suddenly the multicast socket is marked as closed, and it seems this new close state is immediately exposed to the Java application as a SocketException. I think I have to dig deeper to verify this (I'm more a Linux kernel guy). If my assumption holds I will try to create a patch for the JVM folks because it looks like a Mac OS X JVM bug. A "too many open files" error should be exposed if this is the underlying reason. On the other hand, the ES multicast socket receiver thread should get protected against such weird exceptions, and close down more gracefully instead of entering an infinite loop. |
also, throttle on socket failures, so it won't spin out of control... relates to #2783
It is strange.., on my mac, I don't run out of file descriptors when running the tests... I have pushed two changes, made multicastSocket volatile (it should have been), if the socket is closed, try and restart the multicast discovery, and also throttle on unknown socket failures so it won't spin.... |
Test is passing now, thanks! Output is at https://gist.github.com/jprante/5200157 I think we can close the issue. Just for info, it's not "too many open files", my machine tells me there are a lot of open files available for the process (in fact the test passes 1024 but UnixOperatingSystemMXBean tell me there are 10240 maximum). In the JVM code for multicast socket receive I find code that throws a SocketException with message "Socket closed" if a file descriptor is NULL or EBADF. This looks very much like a low-level issue in the OS and I'm still investigating. I have updated to Mac OS X 10.8.3 but problem persists. |
# By Shay Banon (43) and others # Via Shay Banon * master-upstream: (97 commits) better comment... if multicast socket closes, try and restart it also, throttle on socket failures, so it won't spin out of control... relates to elastic#2783 multicastSocket should be volatile as well... broadcast API to by default ignore missing / illegal shard state this happens for example because we list assigned shards, and they might not have been allocated on the relevant node yet, no need to list those as actual failures in some APIs upgrade to guava 14.0.1 tar.gz distro by mistake include a windows lib fix javadoc Correct filter strategy opt: random_access_random to random_access_always Field Data: optimize long type to use narrowest possible type automatically closes elastic#2795 make ES compile with java 8 - that isAnnotationPresent bug is known, and probably will be fixed in later versions, but it costs us nothing to not use it now - some tests fail, mainly due to consistent ordering expected from Map (within versions) which does not seem to be preserved, need to fix those tests to be agnostic to it use ImmutableList.Builder instead of ArrayList fix logging message to include the index also add the list of current indices Mapping: dynamic flag is explicitly returned even when not set fixes elastic#2789 Fix bug in RateLimiter.SimpleRateLimiter causing numeric overflow in StoreStats improve TODO comment add CamelCase support to Suggester where missing Remove `sort_order` and `sort_mode` in favor of `order` and `mode` Add `sort_oder` and `sortOrder` as valid field names for defining the sort order in a Sort object. Make StupidBackoff the default smoothing model for phrase suggester minor cleanup suggest api - make sure we close the parser - fail when no content is provided in the rest request - reuse the suggest parse element ...
With this patch, Maven surefire test runs successful on Mac OS X 10.8 (1164 tests successful).
Without patch, it hangs in infinite loop, logging a socket exception "socket is closed" over and over again.