-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
STORM-946: We should remove Closed Client form cached-node+port->socket in worker #639
STORM-946: We should remove Closed Client form cached-node+port->socket in worker #639
Conversation
+1 |
|
||
|
||
closed-connections (into [] (for [[node+port socket] @(:cached-node+port->socket worker)] | ||
(if (.isClosed socket) node+port))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1
We need take attention to unneeded connections but not closed connections ,
sometimes closed connections need to be reconnected by retry policy to make sure reconnect losted connections.
refresh-connections will try remove unneeded connections
by the way:
java.lang.IllegalArgumentException: No matching field found: isClosed for class backtype.storm.messaging.netty.Client
May be you should use client.status() somewhere when you need .
Yes we should use client.status(), and I am sorry for this error, I will add new patch to fix this. @caofangkun Will you have a look at Client, once client's status change to Closed, it will never have change to change to other state, that is to say reconnect won't be called after client closed. The problem happened in our product cluster when some net error happen, but after net recover, the topology can not recover. After I remove closed client first, this problem never come again. |
@tedxia Does have any situation will call |
Will you have a look at the description of this pull request, it describe the situation. |
Sorry for commenting on such an old pull request, but we recently also encountered the problem this patch is trying to address.
Indeed there is one other situation where the close() method is called: This only happens if the number of reconnects reaches the number set in the 'storm.messaging.netty.max_retries' config setting (I believe the default is 300). But the problem is aggravated by the fact, that the There is another more serious issue that lead to huge problems in one of out topologies whenever a worker crashed due to some exception. In our case this frequenty happened because nimbus often rescheduled the whole toplogy right around the same time when the supervisor restarted the failed worker, leading to a race condition. This might have to do with the fact that Tedxia's patch won't fix this problem I belive, since the status returned by |
Please file a JIRA at https://issues.apache.org/jira under the STORM project describing what you described here. We ran into a similar issue, but I would have to check on the progress of that fix/etc, and if it could be back ported to 0.9.x |
@revans2 was there ever an issue filed about this? We are seeing the same behavior in Storm 0.10.0 |
Actually it looks like this issue
may be fixed by STORM-1609 with the addition of the client keepalive TimerTask |
I can confirm that it is |
We are closing stale Pull Requests to make the list more manageable. Please re-open any Pull Request that has been closed in error. Closes apache#608 Closes apache#639 Closes apache#640 Closes apache#648 Closes apache#662 Closes apache#668 Closes apache#692 Closes apache#705 Closes apache#724 Closes apache#728 Closes apache#730 Closes apache#753 Closes apache#803 Closes apache#854 Closes apache#922 Closes apache#986 Closes apache#992 Closes apache#1019 Closes apache#1040 Closes apache#1041 Closes apache#1043 Closes apache#1046 Closes apache#1051 Closes apache#1078 Closes apache#1146 Closes apache#1164 Closes apache#1165 Closes apache#1178 Closes apache#1213 Closes apache#1225 Closes apache#1258 Closes apache#1259 Closes apache#1268 Closes apache#1272 Closes apache#1277 Closes apache#1278 Closes apache#1288 Closes apache#1296 Closes apache#1328 Closes apache#1342 Closes apache#1353 Closes apache#1370 Closes apache#1376 Closes apache#1391 Closes apache#1395 Closes apache#1399 Closes apache#1406 Closes apache#1410 Closes apache#1422 Closes apache#1427 Closes apache#1443 Closes apache#1462 Closes apache#1468 Closes apache#1483 Closes apache#1506 Closes apache#1509 Closes apache#1515 Closes apache#1520 Closes apache#1521 Closes apache#1525 Closes apache#1527 Closes apache#1544 Closes apache#1550 Closes apache#1566 Closes apache#1569 Closes apache#1570 Closes apache#1575 Closes apache#1580 Closes apache#1584 Closes apache#1591 Closes apache#1600 Closes apache#1611 Closes apache#1613 Closes apache#1639 Closes apache#1703 Closes apache#1711 Closes apache#1719 Closes apache#1737 Closes apache#1760 Closes apache#1767 Closes apache#1768 Closes apache#1785 Closes apache#1799 Closes apache#1822 Closes apache#1824 Closes apache#1844 Closes apache#1874 Closes apache#1918 Closes apache#1928 Closes apache#1937 Closes apache#1942 Closes apache#1951 Closes apache#1957 Closes apache#1963 Closes apache#1964 Closes apache#1965 Closes apache#1967 Closes apache#1968 Closes apache#1971 Closes apache#1985 Closes apache#1986 Closes apache#1998 Closes apache#2031 Closes apache#2032 Closes apache#2071 Closes apache#2076 Closes apache#2108 Closes apache#2119 Closes apache#2128 Closes apache#2142 Closes apache#2174 Closes apache#2206 Closes apache#2297 Closes apache#2322 Closes apache#2332 Closes apache#2341 Closes apache#2377 Closes apache#2414 Closes apache#2469
We are closing stale Pull Requests to make the list more manageable. Please re-open any Pull Request that has been closed in error. Closes apache#608 Closes apache#639 Closes apache#640 Closes apache#648 Closes apache#662 Closes apache#668 Closes apache#692 Closes apache#705 Closes apache#724 Closes apache#728 Closes apache#730 Closes apache#753 Closes apache#803 Closes apache#854 Closes apache#922 Closes apache#986 Closes apache#992 Closes apache#1019 Closes apache#1040 Closes apache#1041 Closes apache#1043 Closes apache#1046 Closes apache#1051 Closes apache#1078 Closes apache#1146 Closes apache#1164 Closes apache#1165 Closes apache#1178 Closes apache#1213 Closes apache#1225 Closes apache#1258 Closes apache#1259 Closes apache#1268 Closes apache#1272 Closes apache#1277 Closes apache#1278 Closes apache#1288 Closes apache#1296 Closes apache#1328 Closes apache#1342 Closes apache#1353 Closes apache#1370 Closes apache#1376 Closes apache#1391 Closes apache#1395 Closes apache#1399 Closes apache#1406 Closes apache#1410 Closes apache#1422 Closes apache#1427 Closes apache#1443 Closes apache#1462 Closes apache#1468 Closes apache#1483 Closes apache#1506 Closes apache#1509 Closes apache#1515 Closes apache#1520 Closes apache#1521 Closes apache#1525 Closes apache#1527 Closes apache#1544 Closes apache#1550 Closes apache#1566 Closes apache#1569 Closes apache#1570 Closes apache#1575 Closes apache#1580 Closes apache#1584 Closes apache#1591 Closes apache#1600 Closes apache#1611 Closes apache#1613 Closes apache#1639 Closes apache#1703 Closes apache#1711 Closes apache#1719 Closes apache#1737 Closes apache#1760 Closes apache#1767 Closes apache#1768 Closes apache#1785 Closes apache#1799 Closes apache#1822 Closes apache#1824 Closes apache#1844 Closes apache#1874 Closes apache#1918 Closes apache#1928 Closes apache#1937 Closes apache#1942 Closes apache#1951 Closes apache#1957 Closes apache#1963 Closes apache#1964 Closes apache#1965 Closes apache#1967 Closes apache#1968 Closes apache#1971 Closes apache#1985 Closes apache#1986 Closes apache#1998 Closes apache#2031 Closes apache#2032 Closes apache#2071 Closes apache#2076 Closes apache#2108 Closes apache#2119 Closes apache#2128 Closes apache#2142 Closes apache#2174 Closes apache#2206 Closes apache#2297 Closes apache#2322 Closes apache#2332 Closes apache#2341 Closes apache#2377 Closes apache#2414 Closes apache#2469
Patch for STORM-946
The client may be Closed status after reconnect failed, and we will remove closed client from Context to escape memory leak.
But there is also reference for the closed Client in cached-node+port->socket in worker, for this reason we should also remove closed Client from cached-node+port->socket.
Meanwhile there is another reason for us to do so. Think about this situation: worker A connect to worker B1 B2, but for some reason worker B1 B2 died at the same, then nimbus reschedule worker B1 B1. And new B1 B2 may partly rescheduled at the some host:port as old B1 B2, that is (old B1: host1+port1, old B2: host2+port2, new B1: host2+port2, new B2: host3+port3). Worker A realized worker B1 B2 died and start reconnect to worker B1 B2, but before new worker B1 and old B2 have the same host+port, and by the current logic, we will remove old B1 Client and and create new Client for new worker B2, and do nothing to old B2 and new B1 because they have the same host+port. This will result the topology stop processing tuples. Once we remove closed Client from cached-node+port->socket before refresh-connections, this will not happen again.