-
Notifications
You must be signed in to change notification settings - Fork 35.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: improve robustness of connect_nodes() #30118
test: improve robustness of connect_nodes() #30118
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code CoverageFor detailed information about the code coverage, see the test coverage report. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tACK 59ba873
Make successful, so are all the functional tests.
Overall I agree with this change because it makes the nodes connection verification dependent more on connect_nodes()
arguments a, b
instead of the earlier approach that relied more on the state of the whole response of getpeerinfo()
, which seemed brittle as mentioned in the PR description.
I can see connect_nodes()
being used in numerous functional tests, thereby increasing robustness for all of them!
@furszy Were you able to identify few connections that were dropped in logs of this CI run? https://cirrus-ci.com/task/5805115213348864?logs=ci#L3200
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@furszy Were you able to identify few connections that were dropped in logs of this CI run? https://cirrus-ci.com/task/5805115213348864?logs=ci#L3200
Just one. One of the P2PInterface
connections I create on the test side gets disconnected after advancing the node time prior to connecting the test nodes again. Need to expand the complete CI logs to find it.
But the fragility is easy to reproduce. Just launch a thread that disconnects a node before calling connect_nodes()
.
59ba873
to
e9cb116
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach ACK, some nits/questions
Tested, functional tests all pass
e9cb116
to
f4c588c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tACK f4c588c
Tested, functional tests all pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tested ACK f4c588c.
very cool! i like how the logic depends only on the node we're connecting to.
reproduced the intermittent failure using this patch and verified that the new logic fixes it!
self.nodes[0].add_p2p_connection(P2PInterface(), send_version=False, wait_for_verack=False)
self.connect_nodes(0, 2)
also didn't observe that much of a time increase (only around 70 s) when running all the functional tests in my local config (without wallets).
I'd say the tests should avoid racy code and deterministically execute the test code line-by-line, but enforcing this in |
The 'connect_nodes' function in the test framework relies on a stable number of peer connections to verify the new connection between the nodes is successfully established. This approach is fragile, as any of the peers involved in the process can drop, lose, or create a connection at any step, causing subsequent 'wait_until' checks to stall indefinitely even when the peers in question are connected successfully. This commit improves the situation by using the nodes' subversion and the connection direction (inbound/outbound) to identify the exact peer connection and perform the checks exclusively on it.
f4c588c
to
6629d1d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK 6629d1d
@@ -106,7 +106,7 @@ def __init__(self, i, datadir_path, *, chain, rpchost, timewait, timeout_factor, | |||
"-debugexclude=libevent", | |||
"-debugexclude=leveldb", | |||
"-debugexclude=rand", | |||
"-uacomment=testnode%d" % i, | |||
"-uacomment=testnode%d" % i, # required for subversion uniqueness across peers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"-uacomment=testnode%d" % i, # required for subversion uniqueness across peers | |
f"-uacomment=testnode{i}", # required for subversion uniqueness across peers |
style-nit, if you retouch
return peer['bytesrecv_per_msg'].pop(msg_type, 0) >= min_bytes_recv | ||
|
||
self.wait_until(lambda: check_bytesrecv(find_conn(from_connection, to_connection_subver, inbound=False), 'verack', 21)) | ||
self.wait_until(lambda: check_bytesrecv(find_conn(to_connection, from_connection_subver, inbound=True), 'verack', 21)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: As you remove the version
check, I wonder if this one can be removed as well for the same reason? The final pong
test should be necessary and sufficient already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: As you remove the
version
check, I wonder if this one can be removed as well for the same reason? The finalpong
test should be necessary and sufficient already.
I removed the version
check because verack
s are direct responses to version
messages at the protocol level. I'm not sure we should rely on the pong
alone because that might change over time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should rely on the
pong
alone because that might change over time?
If it changes (for example the ping is sent before the verack), the test will already start to fail intermittently and will need to be adjusted anyway.
See the next line comment:
# The message bytes are counted before processing the message, so make
# sure it was fully processed by waiting for a ping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should rely on the
pong
alone because that might change over time?If it changes (for example the ping is sent before the verack), the test will already start to fail intermittently and will need to be adjusted anyway.
I don't think thats applicable here. We cannot freely change the initial negotiation phase (the version-verack window) without a BIP for the p2p protocol change. At the protocol level, only features negotiation messages are allowed in this phase. Any other received message will be ignored and it is a reason for banning the other party.
See the next line comment:
# The message bytes are counted before processing the message, so make # sure it was fully processed by waiting for a ping.
What about including the flag indicating that the connection is ready (fSuccessfullyConnected
) in the getpeerinfo
response? It seems generally useful and would clean up all this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should rely on the
pong
alone because that might change over time?If it changes (for example the ping is sent before the verack), the test will already start to fail intermittently and will need to be adjusted anyway.
I don't think thats applicable here. We cannot freely change the initial negotiation phase (the version-verack window) without a BIP for the p2p protocol change. At the protocol level, only features negotiation messages are allowed in this phase. Any other received message will be ignored and it is a reason for banning the other party.
Yes, that is what I am trying to say. The pong
alone should be necessary and sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is what I am trying to say. The pong alone should be necessary and sufficient.
Ah ok. "goto the first message".. now we are sync. Nice bikeshedding from my end.
So yeah, we could wait only for the pong
or introduce a new field on the getpeerinfo
output (fSuccessfullyConnected
with another name like "handshake_completed").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'll have to tell maintainers if you want to address this nit, or leave it for a follow-up, otherwise they won't know whether it is fine to merge this from your side or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'll have to tell maintainers if you want to address this nit, or leave it for a follow-up, otherwise they won't know whether it is fine to merge this from your side or not.
ok, yes. I was thinking on the second approach, but let's move forward. This will let me un-draft #27837. Happy to review any follow-up doing this.
Thanks Marko for the ping and this discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-ACK 6629d1d
assert peer is not None, "Error: peer disconnected" | ||
return peer['bytesrecv_per_msg'].pop(msg_type, 0) >= min_bytes_recv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nit: Mixture of '
and "
strings here and in a few places, though it was already mixed before -- "
seems to be the style in this file
reACK 6629d1d. |
ACK 6629d1d |
Decoupled from #27837 because this can help other too, found it investigating a CI failure https://cirrus-ci.com/task/5805115213348864?logs=ci#L3200.
The
connect_nodes
function in the test framework relies on a stable number of peerconnections to verify that the new connection between the nodes is successfully established.
This approach is fragile, as any of the peers involved in the process can drop, lose, or
create a connection at any step, causing subsequent
wait_until
checks to stall indefinitelyeven when the peers in question were connected successfully.
This commit improves the situation by using the nodes' subversion and the connection
direction (inbound/outbound) to identify the exact peer connection and perform the
checks exclusively on it.