New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STORM-2194: Report error and die, not report error or die #1767

Open
wants to merge 1 commit into
from

Conversation

Projects
None yet
3 participants
@chawco
Contributor

chawco commented Nov 9, 2016

We should still kill our executor after encountering an unhandled exception. This change logs what happens as before, but also ensures we always call suicide-fn to ensure our worker exits.

@revans2

This comment has been minimized.

Show comment
Hide comment
@revans2

revans2 Nov 21, 2016

Contributor

-1.

InterruptedException we often use to indicate that the process is shutting down, and we want to not blow up when we see them. Perhaps what we want to do is to better document it.

Contributor

revans2 commented Nov 21, 2016

-1.

InterruptedException we often use to indicate that the process is shutting down, and we want to not blow up when we see them. Perhaps what we want to do is to better document it.

@sathyafmt

This comment has been minimized.

Show comment
Hide comment
@sathyafmt

sathyafmt Dec 1, 2016

@revans2 - His change in storm-1.0.2 is in the report-error-and-die function, so it should shutdown, correct ?

@revans2 - His change in storm-1.0.2 is in the report-error-and-die function, so it should shutdown, correct ?

@chawco

This comment has been minimized.

Show comment
Hide comment
@chawco

chawco Dec 1, 2016

Contributor

So, there's some ambiguity in place here that would need to get resolved. Basically, there's two sources for InterruptedException/InterruptedIOException -- one from Storm itself, and one from the user's code (See: STORM-2194 If we want to have different behaviour from these two sources we do need some solution to disambiguate the uses.

Basically, the problem is this: If anything in the user's thread raises these exceptions, the executor thread will terminate but the worker will not. This leaves the entire topology in a zombified state. I find it difficult to see that this behaviour is "as intended".

@revans2 can you suggest a way we can disambiguate between Storm initiated and user initiated exceptions here? I'm having a real tough time thinking how we could accomplish that. Alternatively, can you propose an alternate implementation for signalling this shutdown? I'm happy to take on that work to make things shutdown in a more appropriate manner, as long as we get to fix this zombie topology behaviour.

Contributor

chawco commented Dec 1, 2016

So, there's some ambiguity in place here that would need to get resolved. Basically, there's two sources for InterruptedException/InterruptedIOException -- one from Storm itself, and one from the user's code (See: STORM-2194 If we want to have different behaviour from these two sources we do need some solution to disambiguate the uses.

Basically, the problem is this: If anything in the user's thread raises these exceptions, the executor thread will terminate but the worker will not. This leaves the entire topology in a zombified state. I find it difficult to see that this behaviour is "as intended".

@revans2 can you suggest a way we can disambiguate between Storm initiated and user initiated exceptions here? I'm having a real tough time thinking how we could accomplish that. Alternatively, can you propose an alternate implementation for signalling this shutdown? I'm happy to take on that work to make things shutdown in a more appropriate manner, as long as we get to fix this zombie topology behaviour.

@revans2

This comment has been minimized.

Show comment
Hide comment
@revans2

revans2 Dec 2, 2016

Contributor

@chawco

Okay so I understand the issue better now. SocketTimeoutException is a subclass of InterruptedIOException.

https://docs.oracle.com/javase/7/docs/api/java/net/SocketTimeoutException.html

I could argue that it is a mistake on the part of java and that it is wrong, but that is already set in stone so we have to deal with it.

I see two options.

  1. We can treat a SocketTimeoutException differently from other InterruptedIOExceptions,
  2. or we can just treat all InterruptedIOExceptions as fatal.

We started ignoring InterruptedIOExceptions because we would occasionally run into them in the supervisor or nimbus local cluster tests and that would fail everything. Having proper behavior is more important than having super stable unit tests, but if we can have both (option 1) I think that would be best.

Contributor

revans2 commented Dec 2, 2016

@chawco

Okay so I understand the issue better now. SocketTimeoutException is a subclass of InterruptedIOException.

https://docs.oracle.com/javase/7/docs/api/java/net/SocketTimeoutException.html

I could argue that it is a mistake on the part of java and that it is wrong, but that is already set in stone so we have to deal with it.

I see two options.

  1. We can treat a SocketTimeoutException differently from other InterruptedIOExceptions,
  2. or we can just treat all InterruptedIOExceptions as fatal.

We started ignoring InterruptedIOExceptions because we would occasionally run into them in the supervisor or nimbus local cluster tests and that would fail everything. Having proper behavior is more important than having super stable unit tests, but if we can have both (option 1) I think that would be best.

@revans2

This comment has been minimized.

Show comment
Hide comment
@revans2

revans2 Dec 2, 2016

Contributor

We should be able to fix this with code like.

(if (or
       (exception-cause? InterruptedException error)
       (and
           (exception-cause? java.io.InterruptedIOException error)
           (not (exception-cause? java.net.SocketTimeoutException))))
Contributor

revans2 commented Dec 2, 2016

We should be able to fix this with code like.

(if (or
       (exception-cause? InterruptedException error)
       (and
           (exception-cause? java.io.InterruptedIOException error)
           (not (exception-cause? java.net.SocketTimeoutException))))
@sathyafmt

This comment has been minimized.

Show comment
Hide comment
@sathyafmt

sathyafmt Dec 2, 2016

Thanks @revans2.

But this wont fix the other problem I am seeing with java.net.BindException (address already in use). See my comment in STORM-2194:
1. when storm workers start & they are not able to bind to 56700 (the rmi port), they hang around and do not die. This is easy to reproduce, I started a nc -l 56700 & started the topology. With your patch, it dies & the supervisor restarts them back again.
2016-12-01 04:24:41.721 STDERR [INFO] Error: Exception thrown by the agent : java.rmi.server.ExportException: Port already in use: 56700; nested exception is:
2016-12-01 04:24:41.722 STDERR [INFO] java.net.BindException: Address already in use

Thanks @revans2.

But this wont fix the other problem I am seeing with java.net.BindException (address already in use). See my comment in STORM-2194:
1. when storm workers start & they are not able to bind to 56700 (the rmi port), they hang around and do not die. This is easy to reproduce, I started a nc -l 56700 & started the topology. With your patch, it dies & the supervisor restarts them back again.
2016-12-01 04:24:41.721 STDERR [INFO] Error: Exception thrown by the agent : java.rmi.server.ExportException: Port already in use: 56700; nested exception is:
2016-12-01 04:24:41.722 STDERR [INFO] java.net.BindException: Address already in use

@revans2

This comment has been minimized.

Show comment
Hide comment
@revans2

revans2 Jan 6, 2017

Contributor

@sathyafmt sorry I have taken so long to respond December was a really crazy month for me. From STORM-2194 I see that the SocketTimeoutException goes through the code being changed. The RMI code does not go through that path at all.

2016-12-01 04:24:41.721 STDERR [INFO] Error: Exception thrown by the agent : java.rmi.server.ExportException: Port already in use: 56700; nested exception is:
2016-12-01 04:24:41.722 STDERR [INFO] java.net.BindException: Address already in use

If it did then we would have exited because BindException and ExportException are neither InterruptedIOException nor InterruptedException.

So this patch, nor the one I proposed would have any impact on the RMI case at all. Something else is catching the ExportException and printing to STDERR the error message above.

Contributor

revans2 commented Jan 6, 2017

@sathyafmt sorry I have taken so long to respond December was a really crazy month for me. From STORM-2194 I see that the SocketTimeoutException goes through the code being changed. The RMI code does not go through that path at all.

2016-12-01 04:24:41.721 STDERR [INFO] Error: Exception thrown by the agent : java.rmi.server.ExportException: Port already in use: 56700; nested exception is:
2016-12-01 04:24:41.722 STDERR [INFO] java.net.BindException: Address already in use

If it did then we would have exited because BindException and ExportException are neither InterruptedIOException nor InterruptedException.

So this patch, nor the one I proposed would have any impact on the RMI case at all. Something else is catching the ExportException and printing to STDERR the error message above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment