intermittent issue in interface_zmq "assert_equal(struct.unpack('<I', seq)[-1], self.sequence)" #20538

maflcko · 2020-12-01T08:13:16Z

Traceback (most recent call last):
  File "./test/functional/test_framework/test_framework.py", line 126, in main
    self.run_test()
  File "./test/functional/interface_zmq.py", line 73, in run_test
    self.test_basic()
  File "./test/functional/interface_zmq.py", line 119, in test_basic
    txid = hashtx.receive()
  File "./test/functional/interface_zmq.py", line 41, in receive
    assert_equal(struct.unpack('<I', seq)[-1], self.sequence)
  File "./test/functional/test_framework/util.py", line 50, in assert_equal
    raise AssertionError("not(%s)" % " == ".join(str(arg) for arg in (thing1, thing2) + args))
AssertionError: not(1 == 0)

The text was updated successfully, but these errors were encountered:

maflcko · 2020-12-01T08:13:45Z

I've seen this multiple times

cc @instagibbs

instagibbs · 2020-12-01T10:06:33Z

Which notifications are you listening to?

…

On Tue, Dec 1, 2020, 4:14 PM MarcoFalke ***@***.***> wrote: I've seen this multiple times cc @instagibbs <https://github.com/instagibbs> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20538 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABMAFU7NPJSKW5Z2IL7646DSSSQUZANCNFSM4UIVKXXA> .

instagibbs · 2020-12-01T12:46:16Z

Ah sorry, email notification confusion.

I'll take a look and see if I can reason about it.

instagibbs · 2020-12-02T12:32:23Z

Nothing fishy in the logs.

It's possible that this chunk of code isn't lenient enough under certain conditions, causing the first message to get lost:

        # Relax so that the subscriber is ready before publishing zmq messages
        sleep(0.2)

maflcko · 2020-12-04T13:42:10Z

Removing the sleep gives zmq.error.Again: Resource temporarily unavailable, which I've also seen in some test failures. Generally, a hardcoded sleep is a recipe for intermittent failures.

Is it not possible for zmq to fully boot up (like the rpc interface) when the server is starting?

How is this dealt with in production?

instagibbs · 2020-12-04T13:44:25Z

In prod you don't "expect" any specific message so you start with whichever one you get. I think for tests we could just have a block making and zmq consumption loop that "syncs up" the expected message then continues onto the rest of the test?

…

On Fri, Dec 4, 2020, 9:42 PM MarcoFalke ***@***.***> wrote: Removing the sleep gives zmq.error.Again: Resource temporarily unavailable, which I've also seen in some test failures. Generally, a hardcoded sleep is a recipe for intermittent failures. Is it not possible for zmq to fully boot up (like the rpc interface) when the server is starting? How is this dealt with in production? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20538 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABMAFU6TPANPRGZ55ANO5RDSTDRMDANCNFSM4UIVKXXA> .

instagibbs · 2020-12-04T13:45:39Z

E.g., make a block, try to read off zmq messages until you get it(with timeout), if that fails with timeout do it again. Rinse repeat until you get the message.

…

On Fri, Dec 4, 2020, 9:44 PM Greg Sanders ***@***.***> wrote: In prod you don't "expect" any specific message so you start with whichever one you get. I think for tests we could just have a block making and zmq consumption loop that "syncs up" the expected message then continues onto the rest of the test? On Fri, Dec 4, 2020, 9:42 PM MarcoFalke ***@***.***> wrote: > Removing the sleep gives zmq.error.Again: Resource temporarily > unavailable, which I've also seen in some test failures. Generally, a > hardcoded sleep is a recipe for intermittent failures. > > Is it not possible for zmq to fully boot up (like the rpc interface) when > the server is starting? > > How is this dealt with in production? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#20538 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABMAFU6TPANPRGZ55ANO5RDSTDRMDANCNFSM4UIVKXXA> > . >

…ubscription) 4efb6c2 zmq test: deduplicate test setup code (node restart, topics subscription) (Sebastian Falbesoner) Pull request description: This PR deduplicates common setup code for the ZMQ functional test. The following steps, previously duplicated in each sub-test, are put into a new method `setup_zmq_test(...)`: - create subscriber sockets (`zmq.SUB`) for each topic with the specified timeout (default 60s) - restart node0 with specified zmq notifications enabled (`-zmqpub...=tcp://127.0.0.1:...`...) - if desired, connect node0 with node1 (note done by default) - connect all susbcriber sockets to publisher (running on node0) - wait a bit (currently 200ms), to _"Relax so that the subscribers are ready before publishing zmq messages"_ Note that the last point should be repaced by a more robust method, as this test is still flaky, see #20934 (also #20590 and #20538). ACKs for top commit: instagibbs: ACK 4efb6c2 laanwj: Code review ACK 4efb6c2 Tree-SHA512: d49626756a9c669f1133f1b73ce273994b58c760ce0d6a4bdaa384f043a74149dc2b9fa66fe990413d9105f9c3b6ea973e099669e8e02f2902a5b84fa995028c

…opics subscription) 4efb6c2 zmq test: deduplicate test setup code (node restart, topics subscription) (Sebastian Falbesoner) Pull request description: This PR deduplicates common setup code for the ZMQ functional test. The following steps, previously duplicated in each sub-test, are put into a new method `setup_zmq_test(...)`: - create subscriber sockets (`zmq.SUB`) for each topic with the specified timeout (default 60s) - restart node0 with specified zmq notifications enabled (`-zmqpub...=tcp://127.0.0.1:...`...) - if desired, connect node0 with node1 (note done by default) - connect all susbcriber sockets to publisher (running on node0) - wait a bit (currently 200ms), to _"Relax so that the subscribers are ready before publishing zmq messages"_ Note that the last point should be repaced by a more robust method, as this test is still flaky, see bitcoin#20934 (also bitcoin#20590 and bitcoin#20538). ACKs for top commit: instagibbs: ACK 4efb6c2 laanwj: Code review ACK bitcoin@4efb6c2 Tree-SHA512: d49626756a9c669f1133f1b73ce273994b58c760ce0d6a4bdaa384f043a74149dc2b9fa66fe990413d9105f9c3b6ea973e099669e8e02f2902a5b84fa995028c

ef21fb7 zmq test: speedup test by whitelisting peers (immediate tx relay) (Sebastian Falbesoner) 5c65463 zmq test: fix flakiness by using more robust sync method (Sebastian Falbesoner) 8666033 zmq test: accept arbitrary sequence start number in ZMQSubscriber (Sebastian Falbesoner) 6014d6e zmq test: dedup message reception handling in ZMQSubscriber (Sebastian Falbesoner) Pull request description: Fixes #20934 by using the "sync up" method described in bitcoin/bitcoin#20538 (comment). After improving robustness with this approach (commits 1-3), it turned out that there were still some fails, but those were unrelated to zmq: Out of 500 runs, 3 times `sync_mempool()` or `sync_blocks()` timed out, which can happen because the trickle relay time has no upper bound -- hence in rare cases, it takes longer than 60s. This is fixed by enabling immediate tx relay on node1 (commit 4), which as a nice side-effect also gives us a rough 2x speedup for the test. For further details, also see the explanations in the commit messages. There is no guarantee that the test is still not flaky, but it would help if potential reviewers would run the following script locally and report how many runs failed (feel free to do less than 1000 runs, as this takes quite a long if ran with `--valgrind`): ``` #!/bin/sh OUTPUT_FILE=./zmq_results echo ===== repeated zmq test ===== > $OUTPUT_FILE for i in `seq 1000`; do echo ------------------------ echo ----- test run $i ----- echo ------------------------ echo --- $i --- >> $OUTPUT_FILE ./test/functional/interface_zmq.py --valgrind if [ $? -ne 0 ]; then echo "FAILED. /o\\" >> $OUTPUT_FILE else echo "PASSED. \\o/" >> $OUTPUT_FILE fi done echo Failed test runs: grep FAILED $OUTPUT_FILE | wc -l ``` ACKs for top commit: jonatack: Light ACK ef21fb7 with the caveat that I was unable to make the test fail with valgrind both here and on master, so I can't vouch that it actually fixes the CI flakiness. The test does run ~2x faster with this. Tree-SHA512: 7a1e7592fbbd98e69e1e1294486b91253e589c72b3c6bbb7f587028ec07cca59b7d984e4ebf256c4bc3e8a529ec77d31842f3dd874038aea0b684abfea50306a

ef21fb7 zmq test: speedup test by whitelisting peers (immediate tx relay) (Sebastian Falbesoner) 5c65463 zmq test: fix flakiness by using more robust sync method (Sebastian Falbesoner) 8666033 zmq test: accept arbitrary sequence start number in ZMQSubscriber (Sebastian Falbesoner) 6014d6e zmq test: dedup message reception handling in ZMQSubscriber (Sebastian Falbesoner) Pull request description: Fixes bitcoin#20934 by using the "sync up" method described in bitcoin#20538 (comment). After improving robustness with this approach (commits 1-3), it turned out that there were still some fails, but those were unrelated to zmq: Out of 500 runs, 3 times `sync_mempool()` or `sync_blocks()` timed out, which can happen because the trickle relay time has no upper bound -- hence in rare cases, it takes longer than 60s. This is fixed by enabling immediate tx relay on node1 (commit 4), which as a nice side-effect also gives us a rough 2x speedup for the test. For further details, also see the explanations in the commit messages. There is no guarantee that the test is still not flaky, but it would help if potential reviewers would run the following script locally and report how many runs failed (feel free to do less than 1000 runs, as this takes quite a long if ran with `--valgrind`): ``` #!/bin/sh OUTPUT_FILE=./zmq_results echo ===== repeated zmq test ===== > $OUTPUT_FILE for i in `seq 1000`; do echo ------------------------ echo ----- test run $i ----- echo ------------------------ echo --- $i --- >> $OUTPUT_FILE ./test/functional/interface_zmq.py --valgrind if [ $? -ne 0 ]; then echo "FAILED. /o\\" >> $OUTPUT_FILE else echo "PASSED. \\o/" >> $OUTPUT_FILE fi done echo Failed test runs: grep FAILED $OUTPUT_FILE | wc -l ``` ACKs for top commit: jonatack: Light ACK ef21fb7 with the caveat that I was unable to make the test fail with valgrind both here and on master, so I can't vouch that it actually fixes the CI flakiness. The test does run ~2x faster with this. Tree-SHA512: 7a1e7592fbbd98e69e1e1294486b91253e589c72b3c6bbb7f587028ec07cca59b7d984e4ebf256c4bc3e8a529ec77d31842f3dd874038aea0b684abfea50306a

Summary: PR description: > Fixes [[bitcoin/bitcoin#20934 | core#20934]] by using the "sync up" method described in [[bitcoin/bitcoin#20538 | core#20538]]. > > After improving robustness with this approach (commits 1-3), it turned out that there were still some fails, but those were unrelated to zmq: Out of 500 runs, 3 times sync_mempool() or sync_blocks() timed out, which can happen because the trickle relay time has no upper bound -- hence in rare cases, it takes longer than 60s. This is fixed by enabling immediate tx relay on node1 (commit 4), which as a nice side-effect also gives us a rough 2x speedup for the test. > > For further details, also see the explanations in the commit messages. Note that the changes in commit 4, the speedup from 45s to 14s runtime using `noban` permissions, were already applied in Bitcoin ABC in D10311. This is a backport of [[bitcoin/bitcoin#21008 | core#21008]] [1/3] bitcoin/bitcoin@6014d6e Depends on D10588 Test Plan: `ninja check-functional` Reviewers: #bitcoin_abc, Fabien Reviewed By: #bitcoin_abc, Fabien Differential Revision: https://reviews.bitcoinabc.org/D10589

maflcko added Bug Tests labels Dec 1, 2020

maflcko mentioned this issue Dec 7, 2020

qa: Intermittent failure in interface_zmq.py "Resource temporarily unavailable" #20590

Closed

This was referenced Dec 16, 2020

Intermittent CI failure in interface_zmq: "Unexpected exception caught during testing" (zmq.error.Again)" #20672

Closed

qa: Relax so that the subscriber is ready before publishing zmq messages #16740

Merged

maflcko mentioned this issue Jan 14, 2021

Fix zmq test flakiness #20934

Closed

theStack mentioned this issue Jan 17, 2021

test: dedup zmq test setup code (node restart, topics subscription) #20953

Merged

theStack mentioned this issue Jan 26, 2021

test: fix zmq test flakiness, improve speed #21008

Merged

maflcko closed this as completed Mar 8, 2021

bitcoin locked as resolved and limited conversation to collaborators Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

intermittent issue in interface_zmq "assert_equal(struct.unpack('<I', seq)[-1], self.sequence)" #20538

intermittent issue in interface_zmq "assert_equal(struct.unpack('<I', seq)[-1], self.sequence)" #20538

maflcko commented Dec 1, 2020 •

edited

maflcko commented Dec 1, 2020

instagibbs commented Dec 1, 2020 via email

instagibbs commented Dec 1, 2020

instagibbs commented Dec 2, 2020

maflcko commented Dec 4, 2020

instagibbs commented Dec 4, 2020 via email

instagibbs commented Dec 4, 2020 via email

intermittent issue in interface_zmq "assert_equal(struct.unpack('<I', seq)[-1], self.sequence)" #20538

intermittent issue in interface_zmq "assert_equal(struct.unpack('<I', seq)[-1], self.sequence)" #20538

Comments

maflcko commented Dec 1, 2020 • edited

maflcko commented Dec 1, 2020

instagibbs commented Dec 1, 2020 via email

instagibbs commented Dec 1, 2020

instagibbs commented Dec 2, 2020

maflcko commented Dec 4, 2020

instagibbs commented Dec 4, 2020 via email

instagibbs commented Dec 4, 2020 via email

maflcko commented Dec 1, 2020 •

edited