test: fix v2 transport intermittent test failure (#29002) #29006

theStack · 2023-12-05T23:42:19Z

This PR improves the following fragile construct for detection of a new connection to the node under test in p2p_v2_transport.py:

bitcoin/test/functional/p2p_v2_transport.py

Lines 154 to 156 in 6d57909

    
           num_peers = len(self.nodes[0].getpeerinfo()) 
        
           s.connect(("127.0.0.1", p2p_port(0))) 
        
           self.wait_until(lambda: len(self.nodes[0].getpeerinfo()) == num_peers + 1)

Only relying on the number of peers for that suffers from race conditions, as unrelated previous peers could disconnect at anytime in-between. In the test run in #29002, the following happens:

getpeerinfo() is called the first time -> assigned to num_peers
previous peer disconnects, the node's peer count is now num_peers - 1 (in most test runs, this happens before the first getpeerinfo call)
new peer connects, the node's peer count is now num_peers
the condition that the node's peer count is num_peers + 1 is never true, test fails

Use the more robust approach of watching for an increased highest peer id instead (again using the getpeerinfo RPC call), with a newly introduced context manager method TestNode.wait_for_new_peer(). Note that for the opposite case of a disconnect, no new method is introduced; this is currently used only once in the test and is also simpler.

Still happy to take suggestions for alternative solutions.

Fixes #29002.

Only relying on the number of peers for detecting a new connection suffers from race conditions, as unrelated previous peers could disconnect at anytime in-between. Use the more robust approach of watching for an increased highest peer id instead (again using the `getpeerinfo` RPC call), with a newly introduced context manager method `TestNode.wait_for_new_peer()`. Fixes bitcoin#29009.

DrahtBot · 2023-12-05T23:42:23Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	stratospher, maflcko
Concept ACK	kevkevinpal

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

kevkevinpal · 2023-12-06T05:03:58Z

Concept ACK 00e0658

stratospher

ACK 00e0658.

cross checked node0's debug log to verify that this getpeerinfo gets called before disconnection is complete.

		2023-12-05T13:26:56.774275Z [net] [net.cpp:1106] [ProcessReceivedKeyBytes] [net] V2 transport error: V1 peer with wrong MessageStart 0a03cf40
		2023-12-05T13:26:56.857216Z [net] [net.cpp:554] [CloseSocketDisconnect] [net] disconnecting peer=5
		2023-12-05T13:26:56.857322Z [http] [httpserver.cpp:306] [http_request_cb] [http] Received a POST request for / from 127.0.0.1:32828
num_peers 	2023-12-05T13:26:56.857568Z [httpworker.2] [rpc/request.cpp:187] [parse] [rpc] ThreadRPCServer method=getpeerinfo user=__cookie__
		2023-12-05T13:26:56.858580Z [net] [logging/timer.h:58] [Log] [lock] Enter: lock contention m_nodes_mutex, net.cpp:1883 started
		2023-12-05T13:26:56.862378Z [http] [httpserver.cpp:306] [http_request_cb] [http] Received a POST request for / from 127.0.0.1:32828
		2023-12-05T13:26:56.862523Z [net] [logging/timer.h:58] [Log] [lock] Enter: lock contention m_nodes_mutex, net.cpp:1883 completed (3865μs)
num_peers-1     2023-12-05T13:26:56.862680Z [net] [net_processing.cpp:1651] [FinalizeNode] [net] Cleared nodestate for peer=5
num_peers-1	2023-12-05T13:26:56.863412Z [httpworker.3] [rpc/request.cpp:187] [parse] [rpc] ThreadRPCServer method=getpeerinfo user=__cookie__
num_peers-1+1   2023-12-05T13:26:56.863771Z [net] [net.cpp:3704] [CNode] [net] Added connection peer=6

possible alternatives i could think of is checking whether the ip address:port combo is present/not present in getpeerinfo. something like:
self.wait_until(lambda: "127.0.0.1:" + str(port) in [p["addrbind"] for p in self.nodes[0].getpeerinfo()])
but i think the current approach is slightly faster.

stratospher · 2023-12-06T06:25:01Z

test/functional/p2p_v2_transport.py

-            num_peers = len(self.nodes[0].getpeerinfo())
-            s.connect(("127.0.0.1", p2p_port(0)))
-            self.wait_until(lambda: len(self.nodes[0].getpeerinfo()) == num_peers + 1)
+            with self.nodes[0].wait_for_new_peer():


00e0658: wondering if we need to wait for previous disconnections here since the unrelated disconnections happen only with peers where we directly open the sockets to the node right? (tests above use TestNode) i like how the approach is consistent with the rest of the test though.

You're right, in this case there shouldn't be any previous disconnect issues and hence the current connection-count approach would also work. I still decided to also use wait_for_new_peer here for consistency reasons and it's less code and more readable.

maflcko · 2023-12-06T08:14:02Z

test/functional/p2p_v2_transport.py

@@ -144,22 +143,23 @@ def run_test(self):
        # Check wrong network prefix detection (hits if the next 12 bytes correspond to a v1 version message)
        wrong_network_magic_prefix = MAGIC_BYTES["signet"] + V1_PREFIX[4:]
        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
-            s.connect(("127.0.0.1", p2p_port(0)))
+            with self.nodes[0].wait_for_new_peer():
+                s.connect(("127.0.0.1", p2p_port(0)))
            with self.nodes[0].assert_debug_log(["V2 transport error: V1 peer with wrong MessageStart"]):
                s.sendall(wrong_network_magic_prefix + b"somepayload")


nit: unrelated: Would be good to wait for CNode desctruction after disconnection by checking the getpeerinfo RPC. This is what I meant in #29002 (comment)

Alternatively, it would be good to pass the timeout option to assert_debug_log, so that it is well-documented that the debug log is used to sync.

maflcko · 2023-12-06T08:16:38Z

test/functional/test_framework/test_node.py

+
+        initial_peer_id = get_highest_peer_id()
+        yield
+        wait_until_helper_internal(lambda: get_highest_peer_id() > initial_peer_id,


nit: unrelated: Could have a self.wait_until to avoid having to pass the timeout factor every time?

maflcko

lgtm. Can't hurt to have this helper, because it also makes code less verbose

theStack · 2023-12-08T09:05:02Z

Thanks for the reviews! I'm leaving the unrelated nits (#29006 (comment) and #29006 (comment)) for a follow-up.

maflcko · 2023-12-08T09:30:01Z

Ok, lgtm ACK 00e0658

fanquake · 2023-12-08T11:49:32Z

Note there was a typo in the commit message. This fixed #29002, not #29009 (which accidently got closed).

ea00f98 test: fix intermittent error in rpc_net.py (#29030) (Sebastian Falbesoner) Pull request description: Asserting for the debug log message "Added connection peer=" is insufficient for ensuring that this new connection will show up in a following getpeerinfo() call, as the debug message is written in the CNode ctor, which means it hasn't necessarily been added to CConnman.m_nodes at this point. Solve this by using the recently introduced `wait_for_new_peer` helper (see #29006, commit 00e0658), which is more robust. Fixes #29030. ACKs for top commit: maflcko: lgtm ACK ea00f98 Tree-SHA512: dda307949a466fb3b24408a8c213d307e0af2155f2e8b4e52c836a22397f9d218bf9d8c54ca55bae62a96d7566f27167db9311dd8801785c327234783af5ed00

DrahtBot added the Tests label Dec 5, 2023

stratospher reviewed Dec 6, 2023

View reviewed changes

maflcko reviewed Dec 6, 2023

View reviewed changes

maflcko approved these changes Dec 6, 2023

View reviewed changes

fanquake merged commit 03042fb into bitcoin:master Dec 8, 2023
16 checks passed

theStack deleted the 202312-test-fix_v2_transport_intermittent_failure branch December 8, 2023 11:34

maflcko mentioned this pull request Dec 8, 2023

test: Add TestNode wait_until helper #29029

Closed

theStack mentioned this pull request Dec 9, 2023

test: fix intermittent error in rpc_net.py (#29030) #29041

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: fix v2 transport intermittent test failure (#29002) #29006

test: fix v2 transport intermittent test failure (#29002) #29006

theStack commented Dec 5, 2023

DrahtBot commented Dec 5, 2023 •

edited

kevkevinpal commented Dec 6, 2023

stratospher left a comment

stratospher Dec 6, 2023

theStack Dec 8, 2023 •

edited

maflcko Dec 6, 2023

maflcko Dec 8, 2023

maflcko Dec 6, 2023

maflcko left a comment

theStack commented Dec 8, 2023

maflcko commented Dec 8, 2023

fanquake commented Dec 8, 2023

	num_peers = len(self.nodes[0].getpeerinfo())
	s.connect(("127.0.0.1", p2p_port(0)))
	self.wait_until(lambda: len(self.nodes[0].getpeerinfo()) == num_peers + 1)

test: fix v2 transport intermittent test failure (#29002) #29006

test: fix v2 transport intermittent test failure (#29002) #29006

Conversation

theStack commented Dec 5, 2023

DrahtBot commented Dec 5, 2023 • edited

Code Coverage

Reviews

kevkevinpal commented Dec 6, 2023

stratospher left a comment

Choose a reason for hiding this comment

stratospher Dec 6, 2023

Choose a reason for hiding this comment

theStack Dec 8, 2023 • edited

Choose a reason for hiding this comment

maflcko Dec 6, 2023

Choose a reason for hiding this comment

maflcko Dec 8, 2023

Choose a reason for hiding this comment

maflcko Dec 6, 2023

Choose a reason for hiding this comment

maflcko left a comment

Choose a reason for hiding this comment

theStack commented Dec 8, 2023

maflcko commented Dec 8, 2023

fanquake commented Dec 8, 2023

DrahtBot commented Dec 5, 2023 •

edited

theStack Dec 8, 2023 •

edited