Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate flaky tests #1139

Open
sssoleileraaa opened this issue Aug 12, 2020 · 26 comments · Fixed by #1249
Open

Investigate flaky tests #1139

sssoleileraaa opened this issue Aug 12, 2020 · 26 comments · Fixed by #1249

Comments

@sssoleileraaa
Copy link
Contributor

sssoleileraaa commented Aug 12, 2020

Description

I've seen the following tests in the client behave flakily :):

  • test_PrintDialog__show_insert_usb_message (see circleci)
        def stop_animate_header(self):
    >       self.header_icon.setVisible(True)
    E       RuntimeError: wrapped C/C++ object of type SvgLabel has been deleted
  • test_ExportDialog__export_file (see circleci)
            # Connect parent signals to slots
    >       self.continue_button.setEnabled(False)
    E       RuntimeError: wrapped C/C++ object of type QPushButton has been deleted
  • test_PrintDialog__print_file (see circleci)
            # Connect parent signals to slots
    >       self.continue_button.setEnabled(False)
    E       RuntimeError: wrapped C/C++ object of type QPushButton has been deleted
  • test_ModalDialog_animation_of_activestate (see circle)
    >       self.error_details.setStyleSheet("")
    E       RuntimeError: wrapped C/C++ object of type QLabel has been deleted

Further investigation needs to be done around why the dialog tests appear to be flaky - could have to do with needing make test assertions before signal handlers are initiated, e.g. the close signal handler will close the dialog so making an assertion on the dialog after it has been closed could cause an issue.

STR

See #1139 (comment)

@sssoleileraaa
Copy link
Contributor Author

sssoleileraaa commented Aug 12, 2020

This could be related to #1134 because that's when I started seeing the flakes. Keeping track of how often this starts happening here.

Update

We've confirmed that the problem persists even when we revert #1134

@sssoleileraaa
Copy link
Contributor Author

sssoleileraaa commented Aug 12, 2020

Also the flakes only seem to occur in circleci. You can see how rerunning the workflow on the main branch resolved it here for #347: https://app.circleci.com/pipelines/github/freedomofpress/securedrop-client?branch=main.

Update

We're able to reproduce locally now (see STR in this Issue's description)

@emkll emkll added this to SecureDrop Sprint #56 - 8/6-8/20 in SecureDrop Team Board Aug 13, 2020
@eloquence
Copy link
Member

We'll continue to keep an eye on this. @rmol has encountered these flakes as well, and has committed to at least a brief investigation during the 8/20-9/2 sprint.

@eloquence
Copy link
Member

No progress on this last sprint but test integrity is important, so @rmol will take a look 9/3-9/17, same timebox.

@eloquence
Copy link
Member

We'll continue to watch for flaky tests on client PRs during the 9/17-10/1 sprint and @rmol will help investigate as warranted.

@emkll
Copy link
Contributor

emkll commented Oct 1, 2020

@eloquence eloquence removed this from SecureDrop Sprint #61 - 10/15-10/28 in SecureDrop Team Board Oct 15, 2020
@sssoleileraaa
Copy link
Contributor Author

@eloquence
Copy link
Member

Given these continued test flakes, we're keeping this on the 10/28-11/12 sprint for continued investigation. Would be good to better understand cause of the flake before the November 4 release, to reduce risk of unintended regression beyond the test flakiness.

@sssoleileraaa
Copy link
Contributor Author

The nice thing about these flakes is that the second time I run ci, the tests seem to always pass.

@eloquence
Copy link
Member

@creviera has committed to poking at John's findings in #1158 during the 11/12-11/25 sprint to see if we can get closer to a root cause here.

@eloquence eloquence added this to Maintenance period (Kanban mode) in SecureDrop Team Board Mar 16, 2021
@emkll
Copy link
Contributor

emkll commented Mar 26, 2021

I have observed three failures in a row on main this morning, all different tests:

@sssoleileraaa
Copy link
Contributor Author

I had a little bit of time to look at this again and I think we need to look carefully at self.button_animation.frameChanged.connect(self.animate_activestate) and self.header_animation.frameChanged.connect(self.animate_header) to understand how the handlers might be called after the export/print dialogs are closed.

@sssoleileraaa
Copy link
Contributor Author

As a reminder, we show an animation next to the file name in the header when we are booting up sd-devices and we show an animation in the continue_button when it's making a request to sd-devices

@eloquence eloquence moved this from 1.8.1 Release Period (Kanban Mode) to In Development in SecureDrop Team Board Apr 12, 2021
@sssoleileraaa
Copy link
Contributor Author

I checked if removing the animations fixed the problem, and even though it made the test failures a lot less likely to occur (because most of the failures are when we access modal widgets to display our animations in) it did not fix the issue. We can still see flaky tests when they initialize a new export or print dialog (see https://app.circleci.com/pipelines/github/freedomofpress/securedrop-client/646/workflows/8237e0d1-9660-4d39-ad96-b2ba38aa2d6b/jobs/5294), which makes it seem more likely to be an issue with test concurrency + ordering since we run our tests in parallel and in random order. Testing methods such as _print_file will close a dialog that a test running in parallel might have just unknowningly sent a signal to. I think the next step is to confirm my hypothesis. Hopefully this is it and we just need to add a dialog fixture or something that provides dialog initialization and cleanup between tests.

@sssoleileraaa
Copy link
Contributor Author

Oh, also, as a side note, https://circleci.com/api/v1.1/project/github/freedomofpress/securedrop-client/5298/output/103/0?file=true&allocation-id=6075fe02363fc6688fd0c788-0-build%2F51B3148A happened twice in the past 30 mins, so marking all our functional tests flaky might not be the best solution in the long run, but one flaky issue at a time...

@eloquence
Copy link
Member

For the 4/15 sprint, @creviera has committed to resolving one type of test flake (possibly the animation-related ones). We'll keep chipping away at this.

@sssoleileraaa
Copy link
Contributor Author

sssoleileraaa commented Apr 22, 2021

For testing, we can use this script to repro and to see that it is fixed.

Update

Meant to say when it is fixed

@conorsch
Copy link
Contributor

Immediately post-merge of #1248, I saw the following failure in CircleCI:

failure output
=================================== FAILURES ===================================
______ test_ExportDialog__update_dialog_when_status_is_USB_NOT_CONNECTED _______

mocker = <pytest_mock.MockFixture object at 0x7fd114d40290>

    def test_ExportDialog__update_dialog_when_status_is_USB_NOT_CONNECTED(mocker):
>       dialog = ExportDialog(mocker.MagicMock(), "mock_uuid", "mock_filename")

tests/gui/test_widgets.py:3827: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <securedrop_client.gui.widgets.ExportDialog object at 0x7fd114eb6690>
controller = <MagicMock id='140535974330704'>, file_uuid = 'mock_uuid'
file_name = 'mock_filename'

    def __init__(self, controller: Controller, file_uuid: str, file_name: str):
        super().__init__()
    
        self.controller = controller
        self.file_uuid = file_uuid
        self.file_name = SecureQLabel(
            file_name, wordwrap=False, max_length=self.FILENAME_WIDTH_PX
        ).text()
        self.error_status = ""  # Hold onto the error status we receive from the Export VM
    
        # Connect controller signals to slots
        self.controller.export.preflight_check_call_success.connect(self._on_preflight_success)
        self.controller.export.preflight_check_call_failure.connect(self._on_preflight_failure)
        self.controller.export.export_usb_call_success.connect(self._on_export_success)
        self.controller.export.export_usb_call_failure.connect(self._on_export_failure)
    
        # Connect parent signals to slots
>       self.continue_button.setEnabled(False)
E       RuntimeError: wrapped C/C++ object of type QPushButton has been deleted

securedrop_client/gui/widgets.py:2730: RuntimeError

@sssoleileraaa
Copy link
Contributor Author

yup! #1248 fixed yet another flaky test. and just a heads up that we also have flaky functional tests. but what you linked to here is the runtime error mentioned in this issue which I'm trying to fix or minimize in #1243

SecureDrop Team Board automation moved this from In Development to Done Apr 27, 2021
@sssoleileraaa
Copy link
Contributor Author

sssoleileraaa commented Nov 4, 2021

@sssoleileraaa sssoleileraaa reopened this Nov 4, 2021
@sssoleileraaa
Copy link
Contributor Author

bumping... we have more eyes on the project so others see this issue now too

@sssoleileraaa
Copy link
Contributor Author

sssoleileraaa commented Dec 22, 2021

it's still pretty rare - nothing like it was before - but it's such a headache that i plan to take a peek 👀 again after the holiday

@gonzalo-bulnes
Copy link
Contributor

@conorsch, @rocodes and I encountered this one today a few times.

Hypothesis:

  • Fact (I think!): Python references to Qt objects should prevent garbage collection of the underlying C++ object.
  • Idea: what if a Python reference was removed while an event is waiting to be processed by the Qt event loop?

@rocodes was noting that specially in a testing context we create and tear down objects very quickly.

Note: even if this hypothesis was true, I don't think that should be happening!! 😬

@gonzalo-bulnes
Copy link
Contributor

I had a little bit of time to look at this again and I think we need to look carefully at self.button_animation.frameChanged.connect(self.animate_activestate) and self.header_animation.frameChanged.connect(self.animate_header) to understand how the handlers might be called after the export/print dialogs are closed.

With the above hypothesis in mind, it would make sense to me that the issue would occur particularly around signals.

@sssoleileraaa
Copy link
Contributor Author

A new test failure occurred in https://app.circleci.com/pipelines/github/freedomofpress/securedrop-client/1955/workflows/b8a2a85f-6edb-44d9-b07b-bd65d35ddcf5/jobs/8077 and looks related to

if self.seen_messages.count() or self.is_read:
, but will have to investigate later:

tests/functional/test_seen.py::test_seen_and_unseen Fatal Python error: Aborted

The test usually passes so this might point us to a race.

@sssoleileraaa sssoleileraaa moved this from Done to Near Term - SD Workstation in SecureDrop Team Board Apr 20, 2022
@sssoleileraaa sssoleileraaa removed this from Near Term - SD Workstation in SecureDrop Team Board Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants