Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixing SocketException handling in MulticastZenPing receiver #2783

Closed
wants to merge 3 commits into from
Closed

fixing SocketException handling in MulticastZenPing receiver #2783

wants to merge 3 commits into from

Conversation

jprante
Copy link
Contributor

@jprante jprante commented Mar 15, 2013

With this patch, Maven surefire test runs successful on Mac OS X 10.8 (1164 tests successful).
Without patch, it hangs in infinite loop, logging a socket exception "socket is closed" over and over again.

@kimchy
Copy link
Member

kimchy commented Mar 17, 2013

What exception do you actually get? SocketException doesn't necessarily means that its no longer running, it might be transient.

@jprante
Copy link
Contributor Author

jprante commented Mar 17, 2013

The exception is

java.net.SocketException: Socket closed
at java.net.PlainDatagramSocketImpl.receive0(Native Method)
at java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:145)
at java.net.DatagramSocket.receive(DatagramSocket.java:786)
at org.elasticsearch.discovery.zen.ping.multicast.MulticastZenPing$Receiver.run(MulticastZenPing.java:374)
at java.lang.Thread.run(Thread.java:722)

See https://gist.github.com/jprante/5183980

The test is integration.recovery.FullRollingRestartTests#testFullRollingRestart

@kimchy
Copy link
Member

kimchy commented Mar 17, 2013

thats strange, because we only close the socket after we set the running flag to true, so this should not be logged (its only logged if its running). Can you maybe check why its in that state (running on OSX 10.8 and it seems fine).

@jprante
Copy link
Contributor Author

jprante commented Mar 17, 2013

I see this in the JavaDoc of java.net.DatagramSocket:

"Any thread currently blocked in receive(java.net.DatagramPacket) upon this socket will throw a SocketException."

http://docs.oracle.com/javase/7/docs/api/java/net/DatagramSocket.html

@kimchy
Copy link
Member

kimchy commented Mar 17, 2013

Sure, but we only actually close the socket after we set running to false (check the doStop method), so I wonder how does the socket ends up being closed, without the running variable being set to false... . Since I can't recreate it here, I need your help :) to check the flow at which this failure happens...

@jprante
Copy link
Contributor Author

jprante commented Mar 18, 2013

I'm checking against a series of JDK versions now. Seems odd that a volatile variable can not be written to.

@kimchy
Copy link
Member

kimchy commented Mar 18, 2013

What do you mean volatile variable can't be written to? which version does it fail on?

@jprante
Copy link
Contributor Author

jprante commented Mar 18, 2013

I tried to replace the volatile running variable with a check Thread.currentThread().isInterrupted() but it did not change much. With all installed JDK 6/7/8, I get the SocketException "Socket closed". Only some runs with JDK 6 do not throw SocketException, still I don't know the reason why. Have to investigate further.

@jprante
Copy link
Contributor Author

jprante commented Mar 18, 2013

Reworked patch, this one checks for "Socket closed" SocketException and thread interrupt state. Although I would like to prefer to always abort the Receiver thread if there is a SocketException, to avoid a loop.

@spinscale
Copy link
Contributor

Hey Joerg,

I am running into this issue as well, but not when running elasticsearch master with standalone tests, but when running the tests of my FST suggester plugin using the current elasticsearch version or master.

However it seems, this issue occurs in my plugin due to a resource leak - are you using elasticsearch and testing some plugin with it or just running the tests on master?

@jprante
Copy link
Contributor Author

jprante commented Mar 18, 2013

Just running on master, no plugins.

Can you explain what was the resource leak?

@kimchy
Copy link
Member

kimchy commented Mar 18, 2013

@jprante I don't like fixing things without understanding why they happen, and as I mentioned, a SocketException can happen not just because the socket is closed. I still would like to understand why the current logic fails before we try and fix something...

@spinscale
Copy link
Contributor

This is really weird.. I can reproduce this on master... but only in about 20% of my test runs...

Used this for a bit more automation:

for i in 1 2 3 4 5 6 7 8 9 10 ; do
  mvn -Dtest=FullRollingRestartTests test 2>&1 | tee /tmp/es-log-$i.log | grep 'Tests run' | grep elapsed
done

I tried making the MulticastSocket volatile, but that does not change anything. Before going on, I'll test on a different operating system, as everyone involved uses a pretty up-to-date mac I guess.

@jprante
Copy link
Contributor Author

jprante commented Mar 18, 2013

@kimchy Agreed. I have to investigate with extra test code and variations of JVM parameters. Maybe Mac OS X JVM specific.

@jprante
Copy link
Contributor Author

jprante commented Mar 18, 2013

@spinscale I will also test on Solaris, Linux, and Windows.

@spinscale
Copy link
Contributor

Ubuntu with the .17 jdk looks stable.

@jprante
Copy link
Contributor Author

jprante commented Mar 18, 2013

From my tests I am quite sure now that executing the test exceeds the default Mac OS X 10.8 process limit of open files. It always breaks at a sudden when the 5th ES node is started up in FullRollingRestartTests. Increasing the process file limit with ulimit -S -n 2048 (via a root bash shell) does no longer break the test, as far as I can judge from my runs.

Interestingly, there is no "too many open files" error thrown. It seems to get hidden somewhere in the internals of Mac OS X JVM. Suddenly the multicast socket is marked as closed, and it seems this new close state is immediately exposed to the Java application as a SocketException.

I think I have to dig deeper to verify this (I'm more a Linux kernel guy). If my assumption holds I will try to create a patch for the JVM folks because it looks like a Mac OS X JVM bug. A "too many open files" error should be exposed if this is the underlying reason.

On the other hand, the ES multicast socket receiver thread should get protected against such weird exceptions, and close down more gracefully instead of entering an infinite loop.

kimchy added a commit that referenced this pull request Mar 19, 2013
also, throttle on socket failures, so it won't spin out of control...
relates to #2783
@kimchy
Copy link
Member

kimchy commented Mar 19, 2013

It is strange.., on my mac, I don't run out of file descriptors when running the tests...

I have pushed two changes, made multicastSocket volatile (it should have been), if the socket is closed, try and restart the multicast discovery, and also throttle on unknown socket failures so it won't spin....

@jprante
Copy link
Contributor Author

jprante commented Mar 19, 2013

Test is passing now, thanks! Output is at https://gist.github.com/jprante/5200157

I think we can close the issue.

Just for info, it's not "too many open files", my machine tells me there are a lot of open files available for the process (in fact the test passes 1024 but UnixOperatingSystemMXBean tell me there are 10240 maximum). In the JVM code for multicast socket receive I find code that throws a SocketException with message "Socket closed" if a file descriptor is NULL or EBADF. This looks very much like a low-level issue in the OS and I'm still investigating. I have updated to Mac OS X 10.8.3 but problem persists.

@jprante jprante closed this Mar 19, 2013
pdegeus added a commit to pdegeus/elasticsearch that referenced this pull request Mar 20, 2013
# By Shay Banon (43) and others
# Via Shay Banon
* master-upstream: (97 commits)
  better comment...
  if multicast socket closes, try and restart it also, throttle on socket failures, so it won't spin out of control... relates to elastic#2783
  multicastSocket should be volatile as well...
  broadcast API to by default ignore missing / illegal shard state this happens for example because we list assigned shards, and they might not have been allocated on the relevant node yet, no need to list those as actual failures in some APIs
  upgrade to guava 14.0.1
  tar.gz distro by mistake include a windows lib
  fix javadoc
  Correct filter strategy opt: random_access_random to random_access_always
  Field Data: optimize long type to use narrowest possible type automatically closes elastic#2795
  make ES compile with java 8 - that isAnnotationPresent bug is known, and probably will be fixed in later versions, but it costs us nothing to not use it now - some tests fail, mainly due to consistent ordering expected from Map (within versions) which does not seem to be preserved, need to fix those tests to be agnostic to it
  use ImmutableList.Builder instead of ArrayList
  fix logging message to include the index also add the list of current indices
  Mapping: dynamic flag is explicitly returned even when not set fixes elastic#2789
  Fix bug in RateLimiter.SimpleRateLimiter causing numeric overflow in StoreStats
  improve TODO comment
  add CamelCase support to Suggester where missing
  Remove `sort_order` and `sort_mode` in favor of `order` and `mode`
  Add `sort_oder` and `sortOrder` as valid field names for defining the sort order in a Sort object.
  Make StupidBackoff the default smoothing model for phrase suggester
  minor cleanup suggest api - make sure we close the parser - fail when no content is provided in the rest request - reuse the suggest parse element
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants