Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
STORM-2194: Report error and die, not report error or die #1767
So, there's some ambiguity in place here that would need to get resolved. Basically, there's two sources for InterruptedException/InterruptedIOException -- one from Storm itself, and one from the user's code (See: STORM-2194 If we want to have different behaviour from these two sources we do need some solution to disambiguate the uses.
Basically, the problem is this: If anything in the user's thread raises these exceptions, the executor thread will terminate but the worker will not. This leaves the entire topology in a zombified state. I find it difficult to see that this behaviour is "as intended".
@revans2 can you suggest a way we can disambiguate between Storm initiated and user initiated exceptions here? I'm having a real tough time thinking how we could accomplish that. Alternatively, can you propose an alternate implementation for signalling this shutdown? I'm happy to take on that work to make things shutdown in a more appropriate manner, as long as we get to fix this zombie topology behaviour.
referenced this pull request
Dec 1, 2016
Okay so I understand the issue better now. SocketTimeoutException is a subclass of InterruptedIOException.
I could argue that it is a mistake on the part of java and that it is wrong, but that is already set in stone so we have to deal with it.
I see two options.
We started ignoring InterruptedIOExceptions because we would occasionally run into them in the supervisor or nimbus local cluster tests and that would fail everything. Having proper behavior is more important than having super stable unit tests, but if we can have both (option 1) I think that would be best.
But this wont fix the other problem I am seeing with java.net.BindException (address already in use). See my comment in STORM-2194:
@sathyafmt sorry I have taken so long to respond December was a really crazy month for me. From STORM-2194 I see that the SocketTimeoutException goes through the code being changed. The RMI code does not go through that path at all.
If it did then we would have exited because BindException and ExportException are neither InterruptedIOException nor InterruptedException.
So this patch, nor the one I proposed would have any impact on the RMI case at all. Something else is catching the ExportException and printing to STDERR the error message above.