Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix zmq test flakiness #20934

Closed
maflcko opened this issue Jan 14, 2021 · 10 comments · Fixed by #21008
Closed

Fix zmq test flakiness #20934

maflcko opened this issue Jan 14, 2021 · 10 comments · Fixed by #21008

Comments

@maflcko
Copy link
Member

maflcko commented Jan 14, 2021

There are many reports of the test being flaky: #20672 (comment)

Thus, it should be made more robust, as described in #20538 (comment)

Useful skills:

  • Background in our functional test suite (python3)
  • Background in zmq

Want to work on this issue?

For guidance on contributing, please read CONTRIBUTING.md before opening your pull request.

@adamjonas
Copy link
Member

Of the last 571 failures, 22 are from the interface_zmq.py functional tests (3.8%). According to the numbers, it's the flakiest functional tests we have.

@domob1812 @theStack @mruddy @n-thumann are any of you willing to give this a shot?

@theStack
Copy link
Contributor

Took some time to look at the problem, it seems to be quite tricky to solve in a solid way. I tried the suggested method of "syncing up" via repeatedly generating a block and waiting for the expected message (until it doesn't timeout anymore), but generating a block seems to interfere with some of the sub-tests. It also already generates notification messages for our subs that are received later (even if we are not connected yet). Maybe something like this would work:

  • restart node with additional pubhashtx test publisher (on a port not used by any of the test subs)
  • repeatedly generate block and wait for expected messages from test publisher, until it doesn't time out anymore
  • invalidate generated blocks
  • clear mempool (needed?)
  • read from our subscriber sockets until there is no data (a "reverse flush" so to say)

Maybe I'm thinking too complicated though.
Whatever the solution will be, at least having a common test setup method should serve as a better basis for solving this issue: #20953

@instagibbs
Copy link
Member

but generating a block seems to interfere with some of the sub-tests

Yes it would require making all the subtests more robust I think.

alternative setup

Seems pretty complicated, and with intentional block rollbacks things can get weird.

fanquake added a commit that referenced this issue Jan 21, 2021
…ubscription)

4efb6c2 zmq test: deduplicate test setup code (node restart, topics subscription) (Sebastian Falbesoner)

Pull request description:

  This PR deduplicates common setup code for the ZMQ functional test. The following steps, previously duplicated in each sub-test, are put into a new method `setup_zmq_test(...)`:
  - create subscriber sockets (`zmq.SUB`) for each topic with the specified timeout (default 60s)
  - restart node0 with specified zmq notifications enabled (`-zmqpub...=tcp://127.0.0.1:...`...)
  - if desired, connect node0 with node1 (note done by default)
  - connect all susbcriber sockets to publisher (running on node0)
  - wait a bit (currently 200ms), to _"Relax so that the subscribers are ready before publishing zmq messages"_

  Note that the last point should be repaced by a more robust method, as this test is still flaky, see #20934 (also #20590 and #20538).

ACKs for top commit:
  instagibbs:
    ACK 4efb6c2
  laanwj:
    Code review ACK 4efb6c2

Tree-SHA512: d49626756a9c669f1133f1b73ce273994b58c760ce0d6a4bdaa384f043a74149dc2b9fa66fe990413d9105f9c3b6ea973e099669e8e02f2902a5b84fa995028c
sidhujag pushed a commit to syscoin/syscoin that referenced this issue Jan 21, 2021
…opics subscription)

4efb6c2 zmq test: deduplicate test setup code (node restart, topics subscription) (Sebastian Falbesoner)

Pull request description:

  This PR deduplicates common setup code for the ZMQ functional test. The following steps, previously duplicated in each sub-test, are put into a new method `setup_zmq_test(...)`:
  - create subscriber sockets (`zmq.SUB`) for each topic with the specified timeout (default 60s)
  - restart node0 with specified zmq notifications enabled (`-zmqpub...=tcp://127.0.0.1:...`...)
  - if desired, connect node0 with node1 (note done by default)
  - connect all susbcriber sockets to publisher (running on node0)
  - wait a bit (currently 200ms), to _"Relax so that the subscribers are ready before publishing zmq messages"_

  Note that the last point should be repaced by a more robust method, as this test is still flaky, see bitcoin#20934 (also bitcoin#20590 and bitcoin#20538).

ACKs for top commit:
  instagibbs:
    ACK 4efb6c2
  laanwj:
    Code review ACK bitcoin@4efb6c2

Tree-SHA512: d49626756a9c669f1133f1b73ce273994b58c760ce0d6a4bdaa384f043a74149dc2b9fa66fe990413d9105f9c3b6ea973e099669e8e02f2902a5b84fa995028c
@maflcko
Copy link
Member Author

maflcko commented Jan 22, 2021

Could a mempool tx be used to sync up instead of a block?

@practicalswift
Copy link
Contributor

practicalswift commented Jan 26, 2021

What about temporarily disabling interface_zmq.py in CI until this is fixed?

It seems to me that interface_zmq.py as it is currently working is a net negative from a CI testing perspective due to its extreme flakiness :)

@instagibbs
Copy link
Member

instagibbs commented Jan 26, 2021 via email

@maflcko
Copy link
Member Author

maflcko commented Jan 26, 2021

Of the last 571 failures, 22 are from the interface_zmq.py functional tests (3.8%). According to the numbers, it's the flakiest functional tests we have.

(quote from @adamjonas )

sidhujag pushed a commit to syscoin/syscoin that referenced this issue Feb 16, 2021
ef21fb7 zmq test: speedup test by whitelisting peers (immediate tx relay) (Sebastian Falbesoner)
5c65463 zmq test: fix flakiness by using more robust sync method (Sebastian Falbesoner)
8666033 zmq test: accept arbitrary sequence start number in ZMQSubscriber (Sebastian Falbesoner)
6014d6e zmq test: dedup message reception handling in ZMQSubscriber (Sebastian Falbesoner)

Pull request description:

  Fixes bitcoin#20934 by using the "sync up" method described in bitcoin#20538 (comment).

  After improving robustness with this approach (commits 1-3), it turned out that there were still some fails, but those were unrelated to zmq: Out of 500 runs, 3 times `sync_mempool()` or `sync_blocks()` timed out, which can happen because the trickle relay time has no upper bound -- hence in rare cases, it takes longer than 60s. This is fixed by enabling immediate tx relay on node1 (commit 4), which as a nice side-effect also gives us a rough 2x speedup for the test.

  For further details, also see the explanations in the commit messages.

  There is no guarantee that the test is still not flaky, but it would help if potential reviewers would run the following script locally and report how many runs failed (feel free to do less than 1000 runs, as this takes quite a long if ran with `--valgrind`):
  ```
  #!/bin/sh
  OUTPUT_FILE=./zmq_results
  echo ===== repeated zmq test ===== > $OUTPUT_FILE

  for i in `seq 1000`; do
      echo ------------------------
      echo ----- test run $i -----
      echo ------------------------
      echo --- $i --- >> $OUTPUT_FILE
      ./test/functional/interface_zmq.py --valgrind
      if [ $? -ne 0 ]; then
          echo "FAILED. /o\\" >> $OUTPUT_FILE
      else
          echo "PASSED. \\o/" >> $OUTPUT_FILE
      fi
  done

  echo Failed test runs:
  grep FAILED $OUTPUT_FILE | wc -l
  ```

ACKs for top commit:
  jonatack:
    Light ACK ef21fb7 with the caveat that I was unable to make the test fail with valgrind both here and on master, so I can't vouch that it actually fixes the CI flakiness. The test does run ~2x faster with this.

Tree-SHA512: 7a1e7592fbbd98e69e1e1294486b91253e589c72b3c6bbb7f587028ec07cca59b7d984e4ebf256c4bc3e8a529ec77d31842f3dd874038aea0b684abfea50306a
@adamjonas adamjonas reopened this Mar 1, 2021
@adamjonas
Copy link
Member

interface_zmq.py flakiness is back and I think #21008 is hurting more than helping.

Before merge of #21008 on 2/16 (Feb 12-15):
Failed 1 time on 1 PR (1,274 bullds)

Same Friday to Monday time period after merge (Feb 19-22):
Failed 11 times across different 9 PRs (1,470 total builds)

@maflcko maflcko closed this as completed Mar 2, 2021
@maflcko
Copy link
Member Author

maflcko commented Mar 2, 2021

Fixed in #21216 ?

@adamjonas
Copy link
Member

ref #21310

Fabcien pushed a commit to Bitcoin-ABC/bitcoin-abc that referenced this issue Nov 30, 2021
Summary:
PR description:
> Fixes [[bitcoin/bitcoin#20934 | core#20934]] by using the "sync up" method described in [[bitcoin/bitcoin#20538 | core#20538]].
>
> After improving robustness with this approach (commits 1-3), it turned out that there were still some fails, but those were unrelated to zmq: Out of 500 runs, 3 times sync_mempool() or sync_blocks() timed out, which can happen because the trickle relay time has no upper bound -- hence in rare cases, it takes longer than 60s. This is fixed by enabling immediate tx relay on node1 (commit 4), which as a nice side-effect also gives us a rough 2x speedup for the test.
>
> For further details, also see the explanations in the commit messages.

Note that the changes in commit 4, the speedup from 45s to 14s runtime  using `noban` permissions, were already applied in Bitcoin ABC in D10311.

This is a backport of [[bitcoin/bitcoin#21008 | core#21008]] [1/3]
bitcoin/bitcoin@6014d6e

Depends on D10588

Test Plan: `ninja check-functional`

Reviewers: #bitcoin_abc, Fabien

Reviewed By: #bitcoin_abc, Fabien

Differential Revision: https://reviews.bitcoinabc.org/D10589
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Aug 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
@theStack @adamjonas @instagibbs @maflcko @practicalswift and others