New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zmq: enable tcp keepalive #14687

Open
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
7 participants
@mruddy
Copy link
Contributor

mruddy commented Nov 8, 2018

This addresses #12754.

These changes enable node operators to address the silent dropping (by network middle boxes) of long-lived low-activity ZMQ TCP connections via further operating system level TCP keepalive configuration. For example, ZMQ sockets that publish block hashes can be affected in this way due to the length of time it sometimes takes between finding blocks (e.g.- sometimes more than an hour).

Prior to this patch, operating system level TCP keepalive configurations would not take effect since the SO_KEEPALIVE option was not enabled on the underlying socket.

There are additional ZMQ socket options related to TCP keepalive that can be set. However, I decided not to implement those options in this changeset because doing so would require adding additional bitcoin node configuration options, and would not yield a better outcome. I preferred a small, easily reviewable patch that doesn't add a bunch of new config options, with the tradeoff that the fine tuning would have to be done via well-documented operating system specific configurations.

I tested this patch by running a node with:
./src/qt/bitcoin-qt -regtest -txindex -datadir=/tmp/node -zmqpubhashblock=tcp://127.0.0.1:28332 &
and connecting to it with:
python3 ./contrib/zmq/zmq_sub.py

Without these changes, ss -panto | grep 28332 | grep ESTAB | grep bitcoin will report no keepalive timer information. With these changes, the output from the prior command will show keepalive timer information consistent with the configuration at the time of connection establishment, e.g.-: timer:(keepalive,119min,0).

I also tested with a non-TCP transport and did not witness any adverse effects:
./src/qt/bitcoin-qt -regtest -txindex -datadir=/tmp/node -zmqpubhashblock=ipc:///tmp/bitcoin.block &

@bitcoin bitcoin deleted a comment from DrahtBot Nov 8, 2018

doc/zmq.md Outdated
sudo sysctl -w net.ipv4.tcp_keepalive_time=600

Setting the keepalive values appropriately for your operating environment may
improve connectiviy in situations where long-lived connections are silently

This comment has been minimized.

@practicalswift

practicalswift Nov 9, 2018

Member

"connectivity" :-)

This comment has been minimized.

@mruddy

mruddy Nov 9, 2018

Contributor

fixed, thanks :-)

@mruddy mruddy force-pushed the mruddy:zmq-keep-alive branch from e7921ce to 66609de Nov 9, 2018

@laanwj

This comment has been minimized.

Copy link
Member

laanwj commented Nov 12, 2018

Looks good to me, but probably needs to be tested by @bitkevin whether it really solves #12754.

@promag
Copy link
Member

promag left a comment

Like changing tcp_keepalive_time globally, is it possible to enable keepalive globally?

@@ -86,6 +86,15 @@ bool CZMQAbstractPublishNotifier::Initialize(void *pcontext)
return false;
}

const int so_keepalive_option {1};
rc = zmq_setsockopt(psocket, ZMQ_TCP_KEEPALIVE, &so_keepalive_option, sizeof(so_keepalive_option));
if (rc != 0)

This comment has been minimized.

@promag

promag Nov 15, 2018

Member

nit, { here.

This comment has been minimized.

@mruddy

mruddy Nov 15, 2018

Contributor

updated the formatting here to be on the same line. i had to force myself to do it because that's what the dev notes say to do. but it was hard because so much of the zmq code is of the other style :)

For example, when running on GNU/Linux, one might use the following
to lower the keepalive setting to 10 minutes:

sudo sysctl -w net.ipv4.tcp_keepalive_time=600

This comment has been minimized.

@promag

promag Nov 15, 2018

Member

Does it make sense to support setting per socket? I mean, allow custom ZMQ_TCP_KEEPALIVE_IDLE?

This comment has been minimized.

@mruddy

mruddy Nov 15, 2018

Contributor

The best reason that I could think of to support my stance of not wanting to add the extra options, besides adding complexity and configuration bloat, is that these timeouts are environmental and so it seems like once you figure them out, it's best to just set an appropriate value at the system level. This does assume that you're not making these ZMQ connections to a bunch of other external-party system environments where per socket config could be useful. I think this assumption is valid though because I perceive the ZMQ stuff to be for data distribution within one's own (internal) environment. Third parties would be running their own nodes and distributing data from those nodes within their own envs. Make sense?

@mruddy

This comment has been minimized.

Copy link
Contributor

mruddy commented Nov 15, 2018

is it possible to enable keepalive globally?

I don't believe it is for Linux or Windows.

@promag

This comment has been minimized.

Copy link
Member

promag commented Nov 15, 2018

@mruddy thanks, I also don't find evidence that it's possible.

@mruddy mruddy force-pushed the mruddy:zmq-keep-alive branch from 66609de to c276df7 Nov 15, 2018

rc = zmq_setsockopt(psocket, ZMQ_TCP_KEEPALIVE, &so_keepalive_option, sizeof(so_keepalive_option));
if (rc != 0) {
zmqError("Failed to set SO_KEEPALIVE");
zmq_close(psocket);

This comment has been minimized.

@luke-jr

luke-jr Dec 20, 2018

Member

Maybe if it fails to set SO_KEEPALIVE, it should still continue with a warning?

This comment has been minimized.

@laanwj

laanwj Jan 8, 2019

Member

When does this fail?
If SO_KEEPALIVE is important, I don't think ignoring errors is a good idea. Not everyone continuously monitors the logs, and it might suddenly result in a return of issue #14687.

This comment has been minimized.

@jgarzik

jgarzik Jan 8, 2019

Contributor

FWIW,

  • The kernel only fails SO_KEEPALIVE setsockopt(2) if there is something assert()-level wrong with the socket (invalid file descriptor; fd is closed; fd is not a socket; basic stuff...)
  • zmq_setsockopt() is not a direct wrapper, but close: the value is stored, and setsockopt(2) is called later in zmq's src/tcp.cpp.
  • zmq_setsockopt() will only return an error if there's something assert()-level wrong with the zmq structure.

Ergo, errors are serious but will not occur due to Linux network stack conditions. Errors only occur due to programmatic error (zmq passed file fd to socket syscall) or memory corruption.

This comment has been minimized.

@luke-jr

luke-jr Jan 8, 2019

Member

I would imagine it fails on (perhaps just hypothetical) systems without SO_KEEPALIVE

This comment has been minimized.

@jgarzik

jgarzik Jan 8, 2019

Contributor

zmq will return 0 (success) on systems without SO_KEEPALIVE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment