Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STORM-946: We should remove Closed Client form cached-node+port->socket in worker #639

Conversation

tedxia
Copy link

@tedxia tedxia commented Jul 17, 2015

Patch for STORM-946

The client may be Closed status after reconnect failed, and we will remove closed client from Context to escape memory leak.

But there is also reference for the closed Client in cached-node+port->socket in worker, for this reason we should also remove closed Client from cached-node+port->socket.
Meanwhile there is another reason for us to do so. Think about this situation: worker A connect to worker B1 B2, but for some reason worker B1 B2 died at the same, then nimbus reschedule worker B1 B1. And new B1 B2 may partly rescheduled at the some host:port as old B1 B2, that is (old B1: host1+port1, old B2: host2+port2, new B1: host2+port2, new B2: host3+port3). Worker A realized worker B1 B2 died and start reconnect to worker B1 B2, but before new worker B1 and old B2 have the same host+port, and by the current logic, we will remove old B1 Client and and create new Client for new worker B2, and do nothing to old B2 and new B1 because they have the same host+port. This will result the topology stop processing tuples. Once we remove closed Client from cached-node+port->socket before refresh-connections, this will not happen again.

@vesense
Copy link
Member

vesense commented Jul 17, 2015

+1



closed-connections (into [] (for [[node+port socket] @(:cached-node+port->socket worker)]
(if (.isClosed socket) node+port)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1

We need take attention to unneeded connections but not closed connections ,
sometimes closed connections need to be reconnected by retry policy to make sure reconnect losted connections.
refresh-connections will try remove unneeded connections

by the way:

java.lang.IllegalArgumentException: No matching field found: isClosed for class  backtype.storm.messaging.netty.Client 

May be you should use client.status() somewhere when you need .

@tedxia
Copy link
Author

tedxia commented Jul 17, 2015

Yes we should use client.status(), and I am sorry for this error, I will add new patch to fix this.

@caofangkun Will you have a look at Client, once client's status change to Closed, it will never have change to change to other state, that is to say reconnect won't be called after client closed.

The problem happened in our product cluster when some net error happen, but after net recover, the topology can not recover. After I remove closed client first, this problem never come again.

@caofangkun
Copy link
Contributor

@tedxia
I could only find two situations ( am I right?) will change client.status to closed by call (.close socket). and do not need reconnect at all
1: when reresh-connections remove unneeded connections and wil remove connecion from cached-node+port->socket
2: when shutdown worker

Does have any situation will call (.close socket) but not removed from cached-node+port->socket ?

@tedxia
Copy link
Author

tedxia commented Jul 17, 2015

Will you have a look at the description of this pull request, it describe the situation.

@nicom
Copy link

nicom commented Mar 31, 2016

@caofangkun

Sorry for commenting on such an old pull request, but we recently also encountered the problem this patch is trying to address.

Does have any situation will call (.close socket) but not removed from cached-node+port->socket ?

Indeed there is one other situation where the close() method is called:
https://github.com/apache/storm/blob/v0.9.6/storm-core/src/jvm/backtype/storm/messaging/netty/Client.java#L505

This only happens if the number of reconnects reaches the number set in the 'storm.messaging.netty.max_retries' config setting (I believe the default is 300). But the problem is aggravated by the fact, that the connectionAttempts counter is never reset. So if a topology runs for a long enough time this might become a problem.
Actually this issue has been fixed in 0.10.0

There is another more serious issue that lead to huge problems in one of out topologies whenever a worker crashed due to some exception.
If worker A sucessfully connects to worker B for the first time during startup but worker B closes the connection for some reason before the :worker-active-flag is set to true (here https://github.com/apache/storm/blob/v0.9.6/storm-core/src/clj/backtype/storm/daemon/worker.clj#L356), there will be no further reconnect attempts, since no messages will be processed and neither send() nor flushMessages() will ever be called.

In our case this frequenty happened because nimbus often rescheduled the whole toplogy right around the same time when the supervisor restarted the failed worker, leading to a race condition. This might have to do with the fact that nimbus.task.timeout.secs and supervisor.worker.timeout.secs have the same value by default.

Tedxia's patch won't fix this problem I belive, since the status returned by status() will be ConnectionWithStatus$Status/Connecting in this situation.

@revans2
Copy link
Contributor

revans2 commented Mar 31, 2016

@nicom

Please file a JIRA at https://issues.apache.org/jira under the STORM project describing what you described here. We ran into a similar issue, but I would have to check on the progress of that fix/etc, and if it could be back ported to 0.9.x

@kevinconaway
Copy link
Contributor

@revans2 was there ever an issue filed about this? We are seeing the same behavior in Storm 0.10.0

@kevinconaway
Copy link
Contributor

Actually it looks like this issue

There is another more serious issue that lead to huge problems in one of out topologies whenever a worker crashed due to some exception.
If worker A sucessfully connects to worker B for the first time during startup but worker B closes the connection for some reason before the :worker-active-flag is set to true (here https://github.com/apache/storm/blob/v0.9.6/storm-core/src/clj/backtype/storm/daemon/worker.clj#L356), there will be no further reconnect attempts, since no messages will be processed and neither send() nor flushMessages() will ever be called.

may be fixed by STORM-1609 with the addition of the client keepalive TimerTask

@nicom
Copy link

nicom commented Jun 16, 2016

may be fixed by STORM-1609 with the addition of the client keepalive TimerTask

I can confirm that it is

d2r pushed a commit to d2r/storm that referenced this pull request Oct 16, 2018
We are closing stale Pull Requests to make the list more manageable.

Please re-open any Pull Request that has been closed in error.

Closes apache#608
Closes apache#639
Closes apache#640
Closes apache#648
Closes apache#662
Closes apache#668
Closes apache#692
Closes apache#705
Closes apache#724
Closes apache#728
Closes apache#730
Closes apache#753
Closes apache#803
Closes apache#854
Closes apache#922
Closes apache#986
Closes apache#992
Closes apache#1019
Closes apache#1040
Closes apache#1041
Closes apache#1043
Closes apache#1046
Closes apache#1051
Closes apache#1078
Closes apache#1146
Closes apache#1164
Closes apache#1165
Closes apache#1178
Closes apache#1213
Closes apache#1225
Closes apache#1258
Closes apache#1259
Closes apache#1268
Closes apache#1272
Closes apache#1277
Closes apache#1278
Closes apache#1288
Closes apache#1296
Closes apache#1328
Closes apache#1342
Closes apache#1353
Closes apache#1370
Closes apache#1376
Closes apache#1391
Closes apache#1395
Closes apache#1399
Closes apache#1406
Closes apache#1410
Closes apache#1422
Closes apache#1427
Closes apache#1443
Closes apache#1462
Closes apache#1468
Closes apache#1483
Closes apache#1506
Closes apache#1509
Closes apache#1515
Closes apache#1520
Closes apache#1521
Closes apache#1525
Closes apache#1527
Closes apache#1544
Closes apache#1550
Closes apache#1566
Closes apache#1569
Closes apache#1570
Closes apache#1575
Closes apache#1580
Closes apache#1584
Closes apache#1591
Closes apache#1600
Closes apache#1611
Closes apache#1613
Closes apache#1639
Closes apache#1703
Closes apache#1711
Closes apache#1719
Closes apache#1737
Closes apache#1760
Closes apache#1767
Closes apache#1768
Closes apache#1785
Closes apache#1799
Closes apache#1822
Closes apache#1824
Closes apache#1844
Closes apache#1874
Closes apache#1918
Closes apache#1928
Closes apache#1937
Closes apache#1942
Closes apache#1951
Closes apache#1957
Closes apache#1963
Closes apache#1964
Closes apache#1965
Closes apache#1967
Closes apache#1968
Closes apache#1971
Closes apache#1985
Closes apache#1986
Closes apache#1998
Closes apache#2031
Closes apache#2032
Closes apache#2071
Closes apache#2076
Closes apache#2108
Closes apache#2119
Closes apache#2128
Closes apache#2142
Closes apache#2174
Closes apache#2206
Closes apache#2297
Closes apache#2322
Closes apache#2332
Closes apache#2341
Closes apache#2377
Closes apache#2414
Closes apache#2469
d2r pushed a commit to d2r/storm that referenced this pull request Oct 16, 2018
We are closing stale Pull Requests to make the list more manageable.

Please re-open any Pull Request that has been closed in error.

Closes apache#608
Closes apache#639
Closes apache#640
Closes apache#648
Closes apache#662
Closes apache#668
Closes apache#692
Closes apache#705
Closes apache#724
Closes apache#728
Closes apache#730
Closes apache#753
Closes apache#803
Closes apache#854
Closes apache#922
Closes apache#986
Closes apache#992
Closes apache#1019
Closes apache#1040
Closes apache#1041
Closes apache#1043
Closes apache#1046
Closes apache#1051
Closes apache#1078
Closes apache#1146
Closes apache#1164
Closes apache#1165
Closes apache#1178
Closes apache#1213
Closes apache#1225
Closes apache#1258
Closes apache#1259
Closes apache#1268
Closes apache#1272
Closes apache#1277
Closes apache#1278
Closes apache#1288
Closes apache#1296
Closes apache#1328
Closes apache#1342
Closes apache#1353
Closes apache#1370
Closes apache#1376
Closes apache#1391
Closes apache#1395
Closes apache#1399
Closes apache#1406
Closes apache#1410
Closes apache#1422
Closes apache#1427
Closes apache#1443
Closes apache#1462
Closes apache#1468
Closes apache#1483
Closes apache#1506
Closes apache#1509
Closes apache#1515
Closes apache#1520
Closes apache#1521
Closes apache#1525
Closes apache#1527
Closes apache#1544
Closes apache#1550
Closes apache#1566
Closes apache#1569
Closes apache#1570
Closes apache#1575
Closes apache#1580
Closes apache#1584
Closes apache#1591
Closes apache#1600
Closes apache#1611
Closes apache#1613
Closes apache#1639
Closes apache#1703
Closes apache#1711
Closes apache#1719
Closes apache#1737
Closes apache#1760
Closes apache#1767
Closes apache#1768
Closes apache#1785
Closes apache#1799
Closes apache#1822
Closes apache#1824
Closes apache#1844
Closes apache#1874
Closes apache#1918
Closes apache#1928
Closes apache#1937
Closes apache#1942
Closes apache#1951
Closes apache#1957
Closes apache#1963
Closes apache#1964
Closes apache#1965
Closes apache#1967
Closes apache#1968
Closes apache#1971
Closes apache#1985
Closes apache#1986
Closes apache#1998
Closes apache#2031
Closes apache#2032
Closes apache#2071
Closes apache#2076
Closes apache#2108
Closes apache#2119
Closes apache#2128
Closes apache#2142
Closes apache#2174
Closes apache#2206
Closes apache#2297
Closes apache#2322
Closes apache#2332
Closes apache#2341
Closes apache#2377
Closes apache#2414
Closes apache#2469
@asfgit asfgit closed this in #2880 Oct 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants