Removes stable tests from quarantine #10768

potiuk · 2020-09-06T15:40:46Z

We've observed the tests for last couple of weeks and it seems
most of the tests marked with "quarantine" marker are succeeding
in a stable way (#10118)
The removed tests have success ratio of > 95% (20 runs without
problems) and this has been verified a week ago as well,
so it seems they are rather stable.

There are literally few that are either failing or causing
the Quarantined builds to hang. I manually reviewed the
master tests that failed for last few weeks and added the
tests that are causing the build to hang.

Seems that stability has improved - which might be casued
by some temporary problems when we marked the quarantined builds
or too "generous" way of marking test as quarantined, or
maybe improvement comes from the #10368 as the docker engine
and machines used to run the builds in GitHub experience far
less load (image builds are executed in separate builds) so
it might be that resource usage is decreased. Another reason
might be Github Actions stability improvements.

Or simply those tests are more stable when run isolation.

We might still add failing tests back as soon we see them behave
in a flaky way.

The remaining quarantined tests that need to be fixed:

test_local_run (often hangs the build)
test_retry_handling_job
test_clear_multiple_external_task_marker
test_should_force_kill_process

We also move some of those tests to "heisentests" category
Those testst run fine in isolation but fail
the builds when run with all other tests:

test_change_state_for_tis_without_dagrun
test_cli_webserver_background
TestImpersonation tests

We might find that those heisentest can be fixed but for
now we are going to run them in isolation.

Also - since those quarantined tests are failing more often
the "num runs" to track for those has been decreased to 10
to keep track of 10 last runs only.

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

XD-DENG · 2020-09-06T15:58:38Z

Thanks @potiuk !

Given it's test-related, I would tend to be more "picky": one check is failing https://github.com/apache/airflow/pull/10768/checks?check_run_id=1078200405 . Looks to me a random failure, but I would suggest to re-run and ensure we get it green, before we merge

potiuk · 2020-09-06T18:08:02Z

Given it's test-related, I would tend to be more "picky": one check is failing

Absolutely @XD-DENG . I am also using this PR to test the behaviour of some tests in/out of isolation.

It seems that we are hitting the (in)famous Heisenbugs: https://en.wikipedia.org/wiki/Heisenbug

If we observe them in isolation, they run just fine. but when you run them together with other tests, they start to fail randomly (most likely on memory problems). I will try to isolate those that are the Heiisnebugs and depending how many of those we have, I will try to find some way to handle those (or maybe indeed change them permanently to run in isolation) :D.

potiuk · 2020-09-07T06:34:20Z

I've added new category "Heisentests" and I am moving those tests that do not like to run together with others there. We might want to fix them at some point of time (decrease resource consumption?) but for now, the best course of action is I think to simply run them in isolation.

We can do it thanks to the #10368 as we need to increase number of jobs (and the overhead per-job is much smaller)

dimberman · 2020-09-07T15:46:52Z

@potiuk +1 for heisentest 😆

TESTING.rst

potiuk · 2020-09-07T17:02:04Z

@potiuk +1 for heisentest

Yeah. I also like the term. It's a slight variation of the https://en.wikipedia.org/wiki/Heisenbug - which is worth reading on its own.

We've observed the tests for last couple of weeks and it seems most of the tests marked with "quarantine" marker are succeeding in a stable way (apache#10118) The removed tests have success ratio of > 95% (20 runs without problems) and this has been verified a week ago as well, so it seems they are rather stable. There are literally few that are either failing or causing the Quarantined builds to hang. I manually reviewed the master tests that failed for last few weeks and added the tests that are causing the build to hang. Seems that stability has improved - which might be casued by some temporary problems when we marked the quarantined builds or too "generous" way of marking test as quarantined, or maybe improvement comes from the apache#10368 as the docker engine and machines used to run the builds in GitHub experience far less load (image builds are executed in separate builds) so it might be that resource usage is decreased. Another reason might be Github Actions stability improvements. Or simply those tests are more stable when run isolation. We might still add failing tests back as soon we see them behave in a flaky way. The remaining quarantined tests that need to be fixed: * test_local_run (often hangs the build) * test_retry_handling_job * test_clear_multiple_external_task_marker * test_should_force_kill_process * test_change_state_for_tis_without_dagrun * test_cli_webserver_background We also move some of those tests to "heisentests" category Those testst run fine in isolation but fail the builds when run with all other tests: * TestImpersonation tests We might find that those heisentest can be fixed but for now we are going to run them in isolation. Also - since those quarantined tests are failing more often the "num runs" to track for those has been decreased to 10 to keep track of 10 last runs only.

We've observed the tests for last couple of weeks and it seems most of the tests marked with "quarantine" marker are succeeding in a stable way (#10118) The removed tests have success ratio of > 95% (20 runs without problems) and this has been verified a week ago as well, so it seems they are rather stable. There are literally few that are either failing or causing the Quarantined builds to hang. I manually reviewed the master tests that failed for last few weeks and added the tests that are causing the build to hang. Seems that stability has improved - which might be casued by some temporary problems when we marked the quarantined builds or too "generous" way of marking test as quarantined, or maybe improvement comes from the #10368 as the docker engine and machines used to run the builds in GitHub experience far less load (image builds are executed in separate builds) so it might be that resource usage is decreased. Another reason might be Github Actions stability improvements. Or simply those tests are more stable when run isolation. We might still add failing tests back as soon we see them behave in a flaky way. The remaining quarantined tests that need to be fixed: * test_local_run (often hangs the build) * test_retry_handling_job * test_clear_multiple_external_task_marker * test_should_force_kill_process * test_change_state_for_tis_without_dagrun * test_cli_webserver_background We also move some of those tests to "heisentests" category Those testst run fine in isolation but fail the builds when run with all other tests: * TestImpersonation tests We might find that those heisentest can be fixed but for now we are going to run them in isolation. Also - since those quarantined tests are failing more often the "num runs" to track for those has been decreased to 10 to keep track of 10 last runs only. (cherry picked from commit b746f33)

We've observed the tests for last couple of weeks and it seems most of the tests marked with "quarantine" marker are succeeding in a stable way (apache#10118) The removed tests have success ratio of > 95% (20 runs without problems) and this has been verified a week ago as well, so it seems they are rather stable. There are literally few that are either failing or causing the Quarantined builds to hang. I manually reviewed the master tests that failed for last few weeks and added the tests that are causing the build to hang. Seems that stability has improved - which might be casued by some temporary problems when we marked the quarantined builds or too "generous" way of marking test as quarantined, or maybe improvement comes from the apache#10368 as the docker engine and machines used to run the builds in GitHub experience far less load (image builds are executed in separate builds) so it might be that resource usage is decreased. Another reason might be Github Actions stability improvements. Or simply those tests are more stable when run isolation. We might still add failing tests back as soon we see them behave in a flaky way. The remaining quarantined tests that need to be fixed: * test_local_run (often hangs the build) * test_retry_handling_job * test_clear_multiple_external_task_marker * test_should_force_kill_process * test_change_state_for_tis_without_dagrun * test_cli_webserver_background We also move some of those tests to "heisentests" category Those testst run fine in isolation but fail the builds when run with all other tests: * TestImpersonation tests We might find that those heisentest can be fixed but for now we are going to run them in isolation. Also - since those quarantined tests are failing more often the "num runs" to track for those has been decreased to 10 to keep track of 10 last runs only. (cherry picked from commit b746f33)

boring-cyborg bot added area:CLI area:dev-tools area:Scheduler including HA (high availability) scheduler area:webserver Webserver related Issues labels Sep 6, 2020

potiuk requested review from ashb, kaxil, mik-laj, XD-DENG and turbaszek September 6, 2020 15:40

potiuk force-pushed the remove-stable-tests-from-quarantine branch 3 times, most recently from 5debe7f to 85108db Compare September 6, 2020 17:50

potiuk force-pushed the remove-stable-tests-from-quarantine branch 4 times, most recently from 93cc591 to 1d398d9 Compare September 7, 2020 06:30

potiuk force-pushed the remove-stable-tests-from-quarantine branch 2 times, most recently from 65fa9cc to 0e3a6c9 Compare September 7, 2020 07:29

dimberman approved these changes Sep 7, 2020

View reviewed changes

kaxil reviewed Sep 7, 2020

View reviewed changes

TESTING.rst Outdated Show resolved Hide resolved

potiuk force-pushed the remove-stable-tests-from-quarantine branch from dba95ab to 862e07b Compare September 7, 2020 18:49

potiuk merged commit b746f33 into apache:master Sep 8, 2020

potiuk deleted the remove-stable-tests-from-quarantine branch September 8, 2020 05:36

potiuk added this to the Airflow 1.10.13 milestone Nov 16, 2020

potiuk added the type:misc/internal Changelog: Misc changes that should appear in change log label Nov 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removes stable tests from quarantine #10768

Removes stable tests from quarantine #10768

potiuk commented Sep 6, 2020 •

edited

Loading

XD-DENG commented Sep 6, 2020

potiuk commented Sep 6, 2020

potiuk commented Sep 7, 2020

dimberman commented Sep 7, 2020

potiuk commented Sep 7, 2020

Removes stable tests from quarantine #10768

Removes stable tests from quarantine #10768

Conversation

potiuk commented Sep 6, 2020 • edited Loading

XD-DENG commented Sep 6, 2020

potiuk commented Sep 6, 2020

potiuk commented Sep 7, 2020

dimberman commented Sep 7, 2020

potiuk commented Sep 7, 2020

potiuk commented Sep 6, 2020 •

edited

Loading