Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CI #3064

Closed
wants to merge 3 commits into from
Closed

Fix CI #3064

wants to merge 3 commits into from

Conversation

Hocuri
Copy link
Collaborator

@Hocuri Hocuri commented Feb 7, 2022

#skip-changelog

The tests are flaky: (sometimes they just run through, though; I repeatetly hit re-run, the first two runs succeeded, in the third run six tests failed at once)

https://pipelines.actions.githubusercontent.com/0Cs9EsF2mwtr7icbiRvpCNrfoVF185ozSjOeo0Ale4nDoOrShu/_apis/pipelines/1/runs/8174/signedlogcontent/9?urlExpires=2022-02-07T20%3A40%3A06.1888390Z&urlSigningMethod=HMACV1&urlSignature=%2FO5JxNmTDQpKnBGVfL8wDKg7TVaN1ypDfmhQdepPyw4%3D

2022-02-07T18:52:10.6629468Z =================================== FAILURES ===================================
2022-02-07T18:52:10.6629891Z ____________________________ tests/test_account.py _____________________________
2022-02-07T18:52:10.6630709Z [gw3] linux -- Python 3.9.10 /home/runner/work/deltachat-core-rust/deltachat-core-rust/python/.tox/py3/bin/python
2022-02-07T18:52:10.6631453Z worker 'gw3' crashed while running 'tests/test_account.py::TestOnlineAccount::test_qr_join_chat'
2022-02-07T18:52:10.6632137Z ____________________________ tests/test_account.py _____________________________
2022-02-07T18:52:10.6632975Z [gw1] linux -- Python 3.9.10 /home/runner/work/deltachat-core-rust/deltachat-core-rust/python/.tox/py3/bin/python
2022-02-07T18:52:10.6633914Z worker 'gw1' crashed while running 'tests/test_account.py::TestOnlineAccount::test_immediate_autodelete'
2022-02-07T18:52:10.6634430Z ____________________________ tests/test_account.py _____________________________
2022-02-07T18:52:10.6635222Z [gw2] linux -- Python 3.9.10 /home/runner/work/deltachat-core-rust/deltachat-core-rust/python/.tox/py3/bin/python
2022-02-07T18:52:10.6635938Z worker 'gw2' crashed while running 'tests/test_account.py::TestOnlineAccount::test_set_get_contact_avatar'
2022-02-07T18:52:10.6636604Z ____________________________ tests/test_account.py _____________________________
2022-02-07T18:52:10.6637196Z [gw4] linux -- Python 3.9.10 /home/runner/work/deltachat-core-rust/deltachat-core-rust/python/.tox/py3/bin/python
2022-02-07T18:52:10.6638069Z worker 'gw4' crashed while running 'tests/test_account.py::TestOnlineAccount::test_group_quote'
2022-02-07T18:52:10.6638554Z ____________________________ tests/test_account.py _____________________________
2022-02-07T18:52:10.6639338Z [gw5] linux -- Python 3.9.10 /home/runner/work/deltachat-core-rust/deltachat-core-rust/python/.tox/py3/bin/python
2022-02-07T18:52:10.6640290Z worker 'gw5' crashed while running 'tests/test_account.py::TestOnlineAccount::test_name_changes'
2022-02-07T18:52:10.6640948Z ____________________________ tests/test_account.py _____________________________
2022-02-07T18:52:10.6641561Z [gw6] linux -- Python 3.9.10 /home/runner/work/deltachat-core-rust/deltachat-core-rust/python/.tox/py3/bin/python
2022-02-07T18:52:10.6642430Z worker 'gw6' crashed while running 'tests/test_account.py::TestOnlineAccount::test_fetch_existing[True]'
2022-02-07T18:52:10.6642919Z ____________________________ tests/test_account.py _____________________________
2022-02-07T18:52:10.6643704Z [gw0] linux -- Python 3.9.10 /home/runner/work/deltachat-core-rust/deltachat-core-rust/python/.tox/py3/bin/python
2022-02-07T18:52:10.6644454Z worker 'gw0' crashed while running 'tests/test_account.py::TestGroupStressTests::test_synchronize_member_list_on_group_rejoin'

@Hocuri
Copy link
Collaborator Author

Hocuri commented Feb 7, 2022

I think I know at least a big part of the cause.

Consider these flaky tests:

    def test_flaky_timeout(self):
        if random() < 0.5:
            time.sleep(20000)

    def test_flaky_error(self):
        if random() < 0.5:
            raise Exception()

If you run with -n2 --reruns 2 --reruns-delay 5 -v -rsXx --ignored --strict-tls -k test_flaky (which is similar to the CI), test_flaky_error() is rerun if it failed, but test_flaky_timeout() not. Because, for some reason rerunning doesn't work for timeouts.

Now, the thing is, before #3059, we had arbitrary timeouts that threw an error after, like, 30 seconds, and effectively turned timeouts into errors. So, a quick and easy solution would be to revert #3059.

IIRC think that the tests were flaky and failing sometimes before already - plus, in an ideal world, it wouldn't be necessary to re-run them at all. But we should probably try to get re-running to work first.

As for why tests are failing in the first place... from the log, it seems like there are a lot of different causes:

  • A message can't be sent in test_set_get_contact_avatar() because of src/smtp.rs:479: Failed to send message over SMTP: SMTP connection failure: SMTP failed to connect: permanent: 5.7.8 Error: authentication failed: .
  • In test_qr_join_chat(), ac2 says that it sends one of the securejoin messages, but then it never arrives at ac1. Either ac2 didn't properly send it, or it got lost somewhere, or ac1 didn't notice it.
  • in test_immediate_autodelete(), imap2.idle_check() doesn't notice a the message that was sent. ac2 does receive the message according to the logs, so seems like the message arrived but idle_check didn't work.
  • test_group_quote(): Excerpt from the log:
2022-02-07T18:49:04.8011781Z 59.98 [events-ac2] DC_EVENT_INFO data1=0 data2=src/imap/idle.rs:73: Idle has NewData ResponseData { raw: 4096, response: MailboxData(Exists(1)) }
2022-02-07T18:49:04.8012329Z 60.20 [events-ac2] DC_EVENT_INFO data1=0 data2=src/job.rs:753: loading job for Imap-thread
2022-02-07T18:49:04.8012825Z 60.43 [events-ac2] DC_EVENT_INFO data1=0 data2=src/imap.rs:823: 1 mails read from "INBOX".
2022-02-07T18:49:04.8013306Z 61.05 [events-ac2] DC_EVENT_IMAP_MESSAGE_MOVED data1=0 data2=IMAP messages 15 moved to DeltaChat
2022-02-07T18:49:04.8013827Z 61.05 [events-ac2] DC_EVENT_INFO data1=0 data2=src/imap/scan_folders.rs:28: Starting full folder scan
2022-02-07T18:49:04.8014321Z 61.46 [events-ac2] DC_EVENT_INFO data1=0 data2=src/imap.rs:823: 0 mails read from "INBOX".
2022-02-07T18:49:04.8014815Z 61.69 [events-ac2] DC_EVENT_CONNECTIVITY_CHANGED data1=0 data2=0
2022-02-07T18:49:04.8015318Z 61.69 [events-ac2] DC_EVENT_INFO data1=0 data2=src/imap.rs:1624: ignoring unsolicited response Recent(1)
2022-02-07T18:49:04.8016127Z 61.69 [events-ac2] DC_EVENT_INFO data1=0 data2=src/imap.rs:1624: ignoring unsolicited response Other(ResponseData { raw: 4096, response: Data { status: Ok, code: Some(CopyUid(1644259495, [Uid(15)], [Uid(4)])), information: Some("Moved UIDs.") } })

Seems like the mail is found on the server and then moved to the DeltaChat folder - this happens because get_many_online_accounts() has default move=True (while the more-frequently-used get_online_configuring_account() has move=False, but maybe it's good if some different settings are used for the different tests). But then it is somehow not found in the DeltaChat folder.

@link2xt do you know if maybe the server doesn't send an Exists when idle-ing on the destination server if it already sends a CopyUid?

  • There are about 4 more timeouts in the log which I didn't analyze.

@link2xt
Copy link
Collaborator

link2xt commented Feb 8, 2022

There is a related issue pytest-dev/pytest-rerunfailures#99 suggesting -o timeout_func_only=true option.

Another maybe related issue: pytest-dev/pytest-rerunfailures#157

@link2xt
Copy link
Collaborator

link2xt commented Feb 8, 2022

maybe the server doesn't send an Exists when idle-ing on the destination server if it already sends a CopyUid?

CopyUid is sent on the Inbox connection, but Movebox connection should still be interrupted with Exists, because it could be even from a client from another device.

@Hocuri
Copy link
Collaborator Author

Hocuri commented Feb 8, 2022

suggesting -o timeout_func_only=true option

I just tried this out (my code from above with -n2 --reruns 2 --reruns-delay 5 -v -rsXx --ignored --strict-tls -k test_flaky -o timeout_func_only=true), but it didn't help.

@link2xt
Copy link
Collaborator

link2xt commented Feb 8, 2022

Works for me with

-timeout_method = thread
+timeout_func_only = True

in tox.ini.

timeout_method was set to thread from the introduction of rerunfailures, probably not needed: ffc525a

@Hocuri Hocuri changed the title [WIP] Try to fix CI... Fix CI Feb 8, 2022
@Hocuri
Copy link
Collaborator Author

Hocuri commented Feb 8, 2022

Nice, thanks, works for me too!

@link2xt
Copy link
Collaborator

link2xt commented Feb 8, 2022

I pushed it as 6c6d47c with some description in the commit message

@link2xt link2xt closed this Feb 8, 2022
@link2xt link2xt deleted the hocuri/try-to-fix-ci branch February 8, 2022 20:49
@link2xt
Copy link
Collaborator

link2xt commented Feb 8, 2022

Hopefully this is even more robust than before, because it will rerun all timeout-related failures rather than only our own manual timeouts.

@link2xt
Copy link
Collaborator

link2xt commented Feb 8, 2022

We are still affected by pytest-dev/pytest-rerunfailures#157 though. When a worker crashes, it is replaced but forgets to do reruns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants