Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STORM-2194: Report error and die, not report error or die #1767

Closed
wants to merge 1 commit into from

Conversation

@chawco
Copy link
Contributor

@chawco chawco commented Nov 9, 2016

We should still kill our executor after encountering an unhandled exception. This change logs what happens as before, but also ensures we always call suicide-fn to ensure our worker exits.

@revans2
Copy link
Contributor

@revans2 revans2 commented Nov 21, 2016

-1.

InterruptedException we often use to indicate that the process is shutting down, and we want to not blow up when we see them. Perhaps what we want to do is to better document it.

@sathyafmt
Copy link

@sathyafmt sathyafmt commented Dec 1, 2016

@revans2 - His change in storm-1.0.2 is in the report-error-and-die function, so it should shutdown, correct ?

@chawco
Copy link
Contributor Author

@chawco chawco commented Dec 1, 2016

So, there's some ambiguity in place here that would need to get resolved. Basically, there's two sources for InterruptedException/InterruptedIOException -- one from Storm itself, and one from the user's code (See: STORM-2194 If we want to have different behaviour from these two sources we do need some solution to disambiguate the uses.

Basically, the problem is this: If anything in the user's thread raises these exceptions, the executor thread will terminate but the worker will not. This leaves the entire topology in a zombified state. I find it difficult to see that this behaviour is "as intended".

@revans2 can you suggest a way we can disambiguate between Storm initiated and user initiated exceptions here? I'm having a real tough time thinking how we could accomplish that. Alternatively, can you propose an alternate implementation for signalling this shutdown? I'm happy to take on that work to make things shutdown in a more appropriate manner, as long as we get to fix this zombie topology behaviour.

@revans2
Copy link
Contributor

@revans2 revans2 commented Dec 2, 2016

@chawco

Okay so I understand the issue better now. SocketTimeoutException is a subclass of InterruptedIOException.

https://docs.oracle.com/javase/7/docs/api/java/net/SocketTimeoutException.html

I could argue that it is a mistake on the part of java and that it is wrong, but that is already set in stone so we have to deal with it.

I see two options.

  1. We can treat a SocketTimeoutException differently from other InterruptedIOExceptions,
  2. or we can just treat all InterruptedIOExceptions as fatal.

We started ignoring InterruptedIOExceptions because we would occasionally run into them in the supervisor or nimbus local cluster tests and that would fail everything. Having proper behavior is more important than having super stable unit tests, but if we can have both (option 1) I think that would be best.

@revans2
Copy link
Contributor

@revans2 revans2 commented Dec 2, 2016

We should be able to fix this with code like.

(if (or
       (exception-cause? InterruptedException error)
       (and
           (exception-cause? java.io.InterruptedIOException error)
           (not (exception-cause? java.net.SocketTimeoutException))))
@sathyafmt
Copy link

@sathyafmt sathyafmt commented Dec 2, 2016

Thanks @revans2.

But this wont fix the other problem I am seeing with java.net.BindException (address already in use). See my comment in STORM-2194:
1. when storm workers start & they are not able to bind to 56700 (the rmi port), they hang around and do not die. This is easy to reproduce, I started a nc -l 56700 & started the topology. With your patch, it dies & the supervisor restarts them back again.
2016-12-01 04:24:41.721 STDERR [INFO] Error: Exception thrown by the agent : java.rmi.server.ExportException: Port already in use: 56700; nested exception is:
2016-12-01 04:24:41.722 STDERR [INFO] java.net.BindException: Address already in use

@revans2
Copy link
Contributor

@revans2 revans2 commented Jan 6, 2017

@sathyafmt sorry I have taken so long to respond December was a really crazy month for me. From STORM-2194 I see that the SocketTimeoutException goes through the code being changed. The RMI code does not go through that path at all.

2016-12-01 04:24:41.721 STDERR [INFO] Error: Exception thrown by the agent : java.rmi.server.ExportException: Port already in use: 56700; nested exception is:
2016-12-01 04:24:41.722 STDERR [INFO] java.net.BindException: Address already in use

If it did then we would have exited because BindException and ExportException are neither InterruptedIOException nor InterruptedException.

So this patch, nor the one I proposed would have any impact on the RMI case at all. Something else is catching the ExportException and printing to STDERR the error message above.

d2r pushed a commit to d2r/storm that referenced this pull request Oct 16, 2018
Derek Dagit
We are closing stale Pull Requests to make the list more manageable.

Please re-open any Pull Request that has been closed in error.

Closes apache#608
Closes apache#639
Closes apache#640
Closes apache#648
Closes apache#662
Closes apache#668
Closes apache#692
Closes apache#705
Closes apache#724
Closes apache#728
Closes apache#730
Closes apache#753
Closes apache#803
Closes apache#854
Closes apache#922
Closes apache#986
Closes apache#992
Closes apache#1019
Closes apache#1040
Closes apache#1041
Closes apache#1043
Closes apache#1046
Closes apache#1051
Closes apache#1078
Closes apache#1146
Closes apache#1164
Closes apache#1165
Closes apache#1178
Closes apache#1213
Closes apache#1225
Closes apache#1258
Closes apache#1259
Closes apache#1268
Closes apache#1272
Closes apache#1277
Closes apache#1278
Closes apache#1288
Closes apache#1296
Closes apache#1328
Closes apache#1342
Closes apache#1353
Closes apache#1370
Closes apache#1376
Closes apache#1391
Closes apache#1395
Closes apache#1399
Closes apache#1406
Closes apache#1410
Closes apache#1422
Closes apache#1427
Closes apache#1443
Closes apache#1462
Closes apache#1468
Closes apache#1483
Closes apache#1506
Closes apache#1509
Closes apache#1515
Closes apache#1520
Closes apache#1521
Closes apache#1525
Closes apache#1527
Closes apache#1544
Closes apache#1550
Closes apache#1566
Closes apache#1569
Closes apache#1570
Closes apache#1575
Closes apache#1580
Closes apache#1584
Closes apache#1591
Closes apache#1600
Closes apache#1611
Closes apache#1613
Closes apache#1639
Closes apache#1703
Closes apache#1711
Closes apache#1719
Closes apache#1737
Closes apache#1760
Closes apache#1767
Closes apache#1768
Closes apache#1785
Closes apache#1799
Closes apache#1822
Closes apache#1824
Closes apache#1844
Closes apache#1874
Closes apache#1918
Closes apache#1928
Closes apache#1937
Closes apache#1942
Closes apache#1951
Closes apache#1957
Closes apache#1963
Closes apache#1964
Closes apache#1965
Closes apache#1967
Closes apache#1968
Closes apache#1971
Closes apache#1985
Closes apache#1986
Closes apache#1998
Closes apache#2031
Closes apache#2032
Closes apache#2071
Closes apache#2076
Closes apache#2108
Closes apache#2119
Closes apache#2128
Closes apache#2142
Closes apache#2174
Closes apache#2206
Closes apache#2297
Closes apache#2322
Closes apache#2332
Closes apache#2341
Closes apache#2377
Closes apache#2414
Closes apache#2469
d2r pushed a commit to d2r/storm that referenced this pull request Oct 16, 2018
Derek Dagit
We are closing stale Pull Requests to make the list more manageable.

Please re-open any Pull Request that has been closed in error.

Closes apache#608
Closes apache#639
Closes apache#640
Closes apache#648
Closes apache#662
Closes apache#668
Closes apache#692
Closes apache#705
Closes apache#724
Closes apache#728
Closes apache#730
Closes apache#753
Closes apache#803
Closes apache#854
Closes apache#922
Closes apache#986
Closes apache#992
Closes apache#1019
Closes apache#1040
Closes apache#1041
Closes apache#1043
Closes apache#1046
Closes apache#1051
Closes apache#1078
Closes apache#1146
Closes apache#1164
Closes apache#1165
Closes apache#1178
Closes apache#1213
Closes apache#1225
Closes apache#1258
Closes apache#1259
Closes apache#1268
Closes apache#1272
Closes apache#1277
Closes apache#1278
Closes apache#1288
Closes apache#1296
Closes apache#1328
Closes apache#1342
Closes apache#1353
Closes apache#1370
Closes apache#1376
Closes apache#1391
Closes apache#1395
Closes apache#1399
Closes apache#1406
Closes apache#1410
Closes apache#1422
Closes apache#1427
Closes apache#1443
Closes apache#1462
Closes apache#1468
Closes apache#1483
Closes apache#1506
Closes apache#1509
Closes apache#1515
Closes apache#1520
Closes apache#1521
Closes apache#1525
Closes apache#1527
Closes apache#1544
Closes apache#1550
Closes apache#1566
Closes apache#1569
Closes apache#1570
Closes apache#1575
Closes apache#1580
Closes apache#1584
Closes apache#1591
Closes apache#1600
Closes apache#1611
Closes apache#1613
Closes apache#1639
Closes apache#1703
Closes apache#1711
Closes apache#1719
Closes apache#1737
Closes apache#1760
Closes apache#1767
Closes apache#1768
Closes apache#1785
Closes apache#1799
Closes apache#1822
Closes apache#1824
Closes apache#1844
Closes apache#1874
Closes apache#1918
Closes apache#1928
Closes apache#1937
Closes apache#1942
Closes apache#1951
Closes apache#1957
Closes apache#1963
Closes apache#1964
Closes apache#1965
Closes apache#1967
Closes apache#1968
Closes apache#1971
Closes apache#1985
Closes apache#1986
Closes apache#1998
Closes apache#2031
Closes apache#2032
Closes apache#2071
Closes apache#2076
Closes apache#2108
Closes apache#2119
Closes apache#2128
Closes apache#2142
Closes apache#2174
Closes apache#2206
Closes apache#2297
Closes apache#2322
Closes apache#2332
Closes apache#2341
Closes apache#2377
Closes apache#2414
Closes apache#2469
@asfgit asfgit closed this in #2880 Oct 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.