zmq: enable tcp keepalive #14687
zmq: enable tcp keepalive #14687
Conversation
sudo sysctl -w net.ipv4.tcp_keepalive_time=600 | ||
|
||
Setting the keepalive values appropriately for your operating environment may | ||
improve connectiviy in situations where long-lived connections are silently |
practicalswift
Nov 9, 2018
Contributor
"connectivity" :-)
"connectivity" :-)
mruddy
Nov 9, 2018
Author
Contributor
fixed, thanks :-)
fixed, thanks :-)
Like changing |
@@ -86,6 +86,15 @@ bool CZMQAbstractPublishNotifier::Initialize(void *pcontext) | |||
return false; | |||
} | |||
|
|||
const int so_keepalive_option {1}; | |||
rc = zmq_setsockopt(psocket, ZMQ_TCP_KEEPALIVE, &so_keepalive_option, sizeof(so_keepalive_option)); | |||
if (rc != 0) |
promag
Nov 15, 2018
Member
nit, {
here.
nit, {
here.
mruddy
Nov 15, 2018
Author
Contributor
updated the formatting here to be on the same line. i had to force myself to do it because that's what the dev notes say to do. but it was hard because so much of the zmq code is of the other style :)
updated the formatting here to be on the same line. i had to force myself to do it because that's what the dev notes say to do. but it was hard because so much of the zmq code is of the other style :)
For example, when running on GNU/Linux, one might use the following | ||
to lower the keepalive setting to 10 minutes: | ||
|
||
sudo sysctl -w net.ipv4.tcp_keepalive_time=600 |
promag
Nov 15, 2018
Member
Does it make sense to support setting per socket? I mean, allow custom ZMQ_TCP_KEEPALIVE_IDLE
?
Does it make sense to support setting per socket? I mean, allow custom ZMQ_TCP_KEEPALIVE_IDLE
?
mruddy
Nov 15, 2018
Author
Contributor
The best reason that I could think of to support my stance of not wanting to add the extra options, besides adding complexity and configuration bloat, is that these timeouts are environmental and so it seems like once you figure them out, it's best to just set an appropriate value at the system level. This does assume that you're not making these ZMQ connections to a bunch of other external-party system environments where per socket config could be useful. I think this assumption is valid though because I perceive the ZMQ stuff to be for data distribution within one's own (internal) environment. Third parties would be running their own nodes and distributing data from those nodes within their own envs. Make sense?
The best reason that I could think of to support my stance of not wanting to add the extra options, besides adding complexity and configuration bloat, is that these timeouts are environmental and so it seems like once you figure them out, it's best to just set an appropriate value at the system level. This does assume that you're not making these ZMQ connections to a bunch of other external-party system environments where per socket config could be useful. I think this assumption is valid though because I perceive the ZMQ stuff to be for data distribution within one's own (internal) environment. Third parties would be running their own nodes and distributing data from those nodes within their own envs. Make sense?
I don't believe it is for Linux or Windows. |
@mruddy thanks, I also don't find evidence that it's possible. |
rc = zmq_setsockopt(psocket, ZMQ_TCP_KEEPALIVE, &so_keepalive_option, sizeof(so_keepalive_option)); | ||
if (rc != 0) { | ||
zmqError("Failed to set SO_KEEPALIVE"); | ||
zmq_close(psocket); |
luke-jr
Dec 20, 2018
Member
Maybe if it fails to set SO_KEEPALIVE, it should still continue with a warning?
Maybe if it fails to set SO_KEEPALIVE, it should still continue with a warning?
laanwj
Jan 8, 2019
Member
When does this fail?
If SO_KEEPALIVE
is important, I don't think ignoring errors is a good idea. Not everyone continuously monitors the logs, and it might suddenly result in a return of issue #14687.
When does this fail?
If SO_KEEPALIVE
is important, I don't think ignoring errors is a good idea. Not everyone continuously monitors the logs, and it might suddenly result in a return of issue #14687.
jgarzik
Jan 8, 2019
Contributor
FWIW,
- The kernel only fails SO_KEEPALIVE setsockopt(2) if there is something assert()-level wrong with the socket (invalid file descriptor; fd is closed; fd is not a socket; basic stuff...)
- zmq_setsockopt() is not a direct wrapper, but close: the value is stored, and setsockopt(2) is called later in zmq's src/tcp.cpp.
- zmq_setsockopt() will only return an error if there's something assert()-level wrong with the zmq structure.
Ergo, errors are serious but will not occur due to Linux network stack conditions. Errors only occur due to programmatic error (zmq passed file fd to socket syscall) or memory corruption.
FWIW,
- The kernel only fails SO_KEEPALIVE setsockopt(2) if there is something assert()-level wrong with the socket (invalid file descriptor; fd is closed; fd is not a socket; basic stuff...)
- zmq_setsockopt() is not a direct wrapper, but close: the value is stored, and setsockopt(2) is called later in zmq's src/tcp.cpp.
- zmq_setsockopt() will only return an error if there's something assert()-level wrong with the zmq structure.
Ergo, errors are serious but will not occur due to Linux network stack conditions. Errors only occur due to programmatic error (zmq passed file fd to socket syscall) or memory corruption.
luke-jr
Jan 8, 2019
Member
I would imagine it fails on (perhaps just hypothetical) systems without SO_KEEPALIVE
I would imagine it fails on (perhaps just hypothetical) systems without SO_KEEPALIVE
jgarzik
Jan 8, 2019
Contributor
zmq will return 0 (success) on systems without SO_KEEPALIVE
zmq will return 0 (success) on systems without SO_KEEPALIVE
Tested ACK. I am running lightningnetwork/lnd in connection with bitcoind on Ubuntu 16.04 and bitcoind dropped the zmq connection always after a few hours (according to output of To resolve this problem I included this patch into Crosscheck: I switched to the unpatched 0.18.0rc1 yesterday and after a few minutes bitcoind dropped the zmq connection again. I returned to the patched 0.17.1 and the connection is since stable again. |
@dlogemann thanks for testing! I am still happy with this patch as well. I think it's ready to merge. |
Hi, when will this PR be merged? |
Needs more review. |
The last travis run for this pull request was 273 days ago and is thus outdated. To trigger a fresh travis build, this pull request should be closed and re-opened. |
So this was never added? Cause there are still issues with zmq and lnd. |
so help review and test this PR, and post an ACK |
I've cherry picked this up into the latest bitcoind version, been running it for the past 15 hours. So far its working great. My lnd node has stayed synced to the network, channels have been opened and persisted fine. The actual zmq code base changes are minimal so seems ok to be pushed into master. However I've only tested this on ubuntu. |
Tested ACK: Works as expected (sends TCP Keepalive packets every |
Just adding some notes from my testing with Windows 10 Home Version 1909 OS build 18363.418 that was downloaded from https://www.microsoft.com/en-us/software-download/windows10ISO
I cross-compiled using the directions at https://github.com/bitcoin/bitcoin/blob/master/doc/build-windows.md. In summary, these changes do not affect the Windows 10 regtest Bitcoin node (i.e.- where the Windows Bitcoin node is the ZMQ publisher). I think the reason why is because the libzmq code does not call These changes are still good for Linux systems. These notes just document how Windows is not affected by this PR because of a bug in the ZMQ library. Additonally, I verified that setting ZMQ_TCP_KEEPALIVE on the subscriber side does work to keep the connections alive and working. Doing that is an alternative to this PR (which sets ZMQ_TCP_KEEPALIVE on the publisher side). The goal is to keep packets going through whatever intermediary has the idle timeout that silently drops the connection. Therefore, the keep alive can be set on either side of the TCP connection to be effective. Helpful related links:
Testing note: the following commands were useful to setup the subscriber side Linux system so that it's firewall would silently drop the connection after 30 seconds of being idle (when ZMQ_TCP_KEEPALIVE was not set [or having the desired effect]).
|
Just to summarize for those looking to review - as of c276df7 there are 3 tACKs (n-thumann, Haaroon, and dlogemann), 1 "looks good to me" (laanwj) with no NACKs or any show-stopping concerns raised. |
I also cherry picked the changes into the current master branch and tested on macOS 10.15.5. I started bitcoind on regtest and used Also, @adamjonas, I think you mean there were 3 tACKs, this seems like it has been pretty well tested. |
@adaminsky updated my previous comment to correct my error. Thanks. |
I think this is ready for merge, unless I'm missing some controversy. |
utACK c276df7 |
c276df7 zmq: enable tcp keepalive (mruddy) Pull request description: This addresses bitcoin#12754. These changes enable node operators to address the silent dropping (by network middle boxes) of long-lived low-activity ZMQ TCP connections via further operating system level TCP keepalive configuration. For example, ZMQ sockets that publish block hashes can be affected in this way due to the length of time it sometimes takes between finding blocks (e.g.- sometimes more than an hour). Prior to this patch, operating system level TCP keepalive configurations would not take effect since the SO_KEEPALIVE option was not enabled on the underlying socket. There are additional ZMQ socket options related to TCP keepalive that can be set. However, I decided not to implement those options in this changeset because doing so would require adding additional bitcoin node configuration options, and would not yield a better outcome. I preferred a small, easily reviewable patch that doesn't add a bunch of new config options, with the tradeoff that the fine tuning would have to be done via well-documented operating system specific configurations. I tested this patch by running a node with: `./src/qt/bitcoin-qt -regtest -txindex -datadir=/tmp/node -zmqpubhashblock=tcp://127.0.0.1:28332 &` and connecting to it with: `python3 ./contrib/zmq/zmq_sub.py` Without these changes, `ss -panto | grep 28332 | grep ESTAB | grep bitcoin` will report no keepalive timer information. With these changes, the output from the prior command will show keepalive timer information consistent with the configuration at the time of connection establishment, e.g.-: `timer:(keepalive,119min,0)`. I also tested with a non-TCP transport and did not witness any adverse effects: `./src/qt/bitcoin-qt -regtest -txindex -datadir=/tmp/node -zmqpubhashblock=ipc:///tmp/bitcoin.block &` ACKs for top commit: adamjonas: Just to summarize for those looking to review - as of c276df7 there are 3 tACKs (n-thumann, Haaroon, and dlogemann), 1 "looks good to me" (laanwj) with no NACKs or any show-stopping concerns raised. jonasschnelli: utACK c276df7 Tree-SHA512: b884c2c9814e97e666546a7188c48f9de9541499a11a934bd48dd16169a900c900fa519feb3b1cb7e9915fc7539aac2829c7806b5937b4e1409b4805f3ef6cd1
This addresses #12754.
These changes enable node operators to address the silent dropping (by network middle boxes) of long-lived low-activity ZMQ TCP connections via further operating system level TCP keepalive configuration. For example, ZMQ sockets that publish block hashes can be affected in this way due to the length of time it sometimes takes between finding blocks (e.g.- sometimes more than an hour).
Prior to this patch, operating system level TCP keepalive configurations would not take effect since the SO_KEEPALIVE option was not enabled on the underlying socket.
There are additional ZMQ socket options related to TCP keepalive that can be set. However, I decided not to implement those options in this changeset because doing so would require adding additional bitcoin node configuration options, and would not yield a better outcome. I preferred a small, easily reviewable patch that doesn't add a bunch of new config options, with the tradeoff that the fine tuning would have to be done via well-documented operating system specific configurations.
I tested this patch by running a node with:
./src/qt/bitcoin-qt -regtest -txindex -datadir=/tmp/node -zmqpubhashblock=tcp://127.0.0.1:28332 &
and connecting to it with:
python3 ./contrib/zmq/zmq_sub.py
Without these changes,
ss -panto | grep 28332 | grep ESTAB | grep bitcoin
will report no keepalive timer information. With these changes, the output from the prior command will show keepalive timer information consistent with the configuration at the time of connection establishment, e.g.-:timer:(keepalive,119min,0)
.I also tested with a non-TCP transport and did not witness any adverse effects:
./src/qt/bitcoin-qt -regtest -txindex -datadir=/tmp/node -zmqpubhashblock=ipc:///tmp/bitcoin.block &