Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removes stable tests from quarantine #10768

Merged

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Sep 6, 2020

We've observed the tests for last couple of weeks and it seems
most of the tests marked with "quarantine" marker are succeeding
in a stable way (#10118)
The removed tests have success ratio of > 95% (20 runs without
problems) and this has been verified a week ago as well,
so it seems they are rather stable.

There are literally few that are either failing or causing
the Quarantined builds to hang. I manually reviewed the
master tests that failed for last few weeks and added the
tests that are causing the build to hang.

Seems that stability has improved - which might be casued
by some temporary problems when we marked the quarantined builds
or too "generous" way of marking test as quarantined, or
maybe improvement comes from the #10368 as the docker engine
and machines used to run the builds in GitHub experience far
less load (image builds are executed in separate builds) so
it might be that resource usage is decreased. Another reason
might be Github Actions stability improvements.

Or simply those tests are more stable when run isolation.

We might still add failing tests back as soon we see them behave
in a flaky way.

The remaining quarantined tests that need to be fixed:

  • test_local_run (often hangs the build)
  • test_retry_handling_job
  • test_clear_multiple_external_task_marker
  • test_should_force_kill_process

We also move some of those tests to "heisentests" category
Those testst run fine in isolation but fail
the builds when run with all other tests:

  • test_change_state_for_tis_without_dagrun
  • test_cli_webserver_background
  • TestImpersonation tests

We might find that those heisentest can be fixed but for
now we are going to run them in isolation.

Also - since those quarantined tests are failing more often
the "num runs" to track for those has been decreased to 10
to keep track of 10 last runs only.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@boring-cyborg boring-cyborg bot added area:CLI area:dev-tools area:Scheduler including HA (high availability) scheduler area:webserver Webserver related Issues labels Sep 6, 2020
@XD-DENG
Copy link
Member

XD-DENG commented Sep 6, 2020

Thanks @potiuk !

Given it's test-related, I would tend to be more "picky": one check is failing https://github.com/apache/airflow/pull/10768/checks?check_run_id=1078200405 . Looks to me a random failure, but I would suggest to re-run and ensure we get it green, before we merge

@potiuk potiuk force-pushed the remove-stable-tests-from-quarantine branch 3 times, most recently from 5debe7f to 85108db Compare September 6, 2020 17:50
@potiuk
Copy link
Member Author

potiuk commented Sep 6, 2020

Given it's test-related, I would tend to be more "picky": one check is failing

Absolutely @XD-DENG . I am also using this PR to test the behaviour of some tests in/out of isolation.

It seems that we are hitting the (in)famous Heisenbugs: https://en.wikipedia.org/wiki/Heisenbug

If we observe them in isolation, they run just fine. but when you run them together with other tests, they start to fail randomly (most likely on memory problems). I will try to isolate those that are the Heiisnebugs and depending how many of those we have, I will try to find some way to handle those (or maybe indeed change them permanently to run in isolation) :D.

@potiuk potiuk force-pushed the remove-stable-tests-from-quarantine branch 4 times, most recently from 93cc591 to 1d398d9 Compare September 7, 2020 06:30
@potiuk
Copy link
Member Author

potiuk commented Sep 7, 2020

I've added new category "Heisentests" and I am moving those tests that do not like to run together with others there. We might want to fix them at some point of time (decrease resource consumption?) but for now, the best course of action is I think to simply run them in isolation.

We can do it thanks to the #10368 as we need to increase number of jobs (and the overhead per-job is much smaller)

@potiuk potiuk force-pushed the remove-stable-tests-from-quarantine branch 2 times, most recently from 65fa9cc to 0e3a6c9 Compare September 7, 2020 07:29
@dimberman
Copy link
Contributor

@potiuk +1 for heisentest 😆

TESTING.rst Outdated Show resolved Hide resolved
@potiuk
Copy link
Member Author

potiuk commented Sep 7, 2020

@potiuk +1 for heisentest

Yeah. I also like the term. It's a slight variation of the https://en.wikipedia.org/wiki/Heisenbug - which is worth reading on its own.

We've observed the tests for last couple of weeks and it seems
most of the tests marked with "quarantine" marker are succeeding
in a stable way (apache#10118)
The removed tests have success ratio of > 95% (20 runs without
problems) and this has been verified a week ago as well,
so it seems they are rather stable.

There are literally few that are either failing or causing
the Quarantined builds to hang. I manually reviewed the
master tests that failed for last few weeks and added the
tests that are causing the build to hang.

Seems that stability has improved - which might be casued
by some temporary problems when we marked the quarantined builds
or too "generous" way of marking test as quarantined, or
maybe improvement comes from the apache#10368 as the docker engine
and machines used to run the builds in GitHub experience far
less load (image builds are executed in separate builds) so
it might be that resource usage is decreased. Another reason
might be Github Actions stability improvements.

Or simply those tests are more stable when run isolation.

We might still add failing tests back as soon we see them behave
in a flaky way.

The remaining quarantined tests that need to be fixed:
 * test_local_run (often hangs the build)
 * test_retry_handling_job
 * test_clear_multiple_external_task_marker
 * test_should_force_kill_process
 * test_change_state_for_tis_without_dagrun
 * test_cli_webserver_background

We also move some of those tests to "heisentests" category
Those testst run fine in isolation but fail
the builds when run with all other tests:
 * TestImpersonation tests

We might find that those heisentest can be fixed but for
now we are going to run them in isolation.

Also - since those quarantined tests are failing more often
the "num runs" to track for those has been decreased to 10
to keep track of 10 last runs only.
@potiuk potiuk force-pushed the remove-stable-tests-from-quarantine branch from dba95ab to 862e07b Compare September 7, 2020 18:49
@potiuk potiuk merged commit b746f33 into apache:master Sep 8, 2020
@potiuk potiuk deleted the remove-stable-tests-from-quarantine branch September 8, 2020 05:36
potiuk added a commit that referenced this pull request Nov 16, 2020
We've observed the tests for last couple of weeks and it seems
most of the tests marked with "quarantine" marker are succeeding
in a stable way (#10118)
The removed tests have success ratio of > 95% (20 runs without
problems) and this has been verified a week ago as well,
so it seems they are rather stable.

There are literally few that are either failing or causing
the Quarantined builds to hang. I manually reviewed the
master tests that failed for last few weeks and added the
tests that are causing the build to hang.

Seems that stability has improved - which might be casued
by some temporary problems when we marked the quarantined builds
or too "generous" way of marking test as quarantined, or
maybe improvement comes from the #10368 as the docker engine
and machines used to run the builds in GitHub experience far
less load (image builds are executed in separate builds) so
it might be that resource usage is decreased. Another reason
might be Github Actions stability improvements.

Or simply those tests are more stable when run isolation.

We might still add failing tests back as soon we see them behave
in a flaky way.

The remaining quarantined tests that need to be fixed:
 * test_local_run (often hangs the build)
 * test_retry_handling_job
 * test_clear_multiple_external_task_marker
 * test_should_force_kill_process
 * test_change_state_for_tis_without_dagrun
 * test_cli_webserver_background

We also move some of those tests to "heisentests" category
Those testst run fine in isolation but fail
the builds when run with all other tests:
 * TestImpersonation tests

We might find that those heisentest can be fixed but for
now we are going to run them in isolation.

Also - since those quarantined tests are failing more often
the "num runs" to track for those has been decreased to 10
to keep track of 10 last runs only.

(cherry picked from commit b746f33)
@potiuk potiuk added this to the Airflow 1.10.13 milestone Nov 16, 2020
@potiuk potiuk added the type:misc/internal Changelog: Misc changes that should appear in change log label Nov 16, 2020
potiuk added a commit that referenced this pull request Nov 16, 2020
We've observed the tests for last couple of weeks and it seems
most of the tests marked with "quarantine" marker are succeeding
in a stable way (#10118)
The removed tests have success ratio of > 95% (20 runs without
problems) and this has been verified a week ago as well,
so it seems they are rather stable.

There are literally few that are either failing or causing
the Quarantined builds to hang. I manually reviewed the
master tests that failed for last few weeks and added the
tests that are causing the build to hang.

Seems that stability has improved - which might be casued
by some temporary problems when we marked the quarantined builds
or too "generous" way of marking test as quarantined, or
maybe improvement comes from the #10368 as the docker engine
and machines used to run the builds in GitHub experience far
less load (image builds are executed in separate builds) so
it might be that resource usage is decreased. Another reason
might be Github Actions stability improvements.

Or simply those tests are more stable when run isolation.

We might still add failing tests back as soon we see them behave
in a flaky way.

The remaining quarantined tests that need to be fixed:
 * test_local_run (often hangs the build)
 * test_retry_handling_job
 * test_clear_multiple_external_task_marker
 * test_should_force_kill_process
 * test_change_state_for_tis_without_dagrun
 * test_cli_webserver_background

We also move some of those tests to "heisentests" category
Those testst run fine in isolation but fail
the builds when run with all other tests:
 * TestImpersonation tests

We might find that those heisentest can be fixed but for
now we are going to run them in isolation.

Also - since those quarantined tests are failing more often
the "num runs" to track for those has been decreased to 10
to keep track of 10 last runs only.

(cherry picked from commit b746f33)
kaxil pushed a commit that referenced this pull request Nov 18, 2020
We've observed the tests for last couple of weeks and it seems
most of the tests marked with "quarantine" marker are succeeding
in a stable way (#10118)
The removed tests have success ratio of > 95% (20 runs without
problems) and this has been verified a week ago as well,
so it seems they are rather stable.

There are literally few that are either failing or causing
the Quarantined builds to hang. I manually reviewed the
master tests that failed for last few weeks and added the
tests that are causing the build to hang.

Seems that stability has improved - which might be casued
by some temporary problems when we marked the quarantined builds
or too "generous" way of marking test as quarantined, or
maybe improvement comes from the #10368 as the docker engine
and machines used to run the builds in GitHub experience far
less load (image builds are executed in separate builds) so
it might be that resource usage is decreased. Another reason
might be Github Actions stability improvements.

Or simply those tests are more stable when run isolation.

We might still add failing tests back as soon we see them behave
in a flaky way.

The remaining quarantined tests that need to be fixed:
 * test_local_run (often hangs the build)
 * test_retry_handling_job
 * test_clear_multiple_external_task_marker
 * test_should_force_kill_process
 * test_change_state_for_tis_without_dagrun
 * test_cli_webserver_background

We also move some of those tests to "heisentests" category
Those testst run fine in isolation but fail
the builds when run with all other tests:
 * TestImpersonation tests

We might find that those heisentest can be fixed but for
now we are going to run them in isolation.

Also - since those quarantined tests are failing more often
the "num runs" to track for those has been decreased to 10
to keep track of 10 last runs only.

(cherry picked from commit b746f33)
cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021
We've observed the tests for last couple of weeks and it seems
most of the tests marked with "quarantine" marker are succeeding
in a stable way (apache#10118)
The removed tests have success ratio of > 95% (20 runs without
problems) and this has been verified a week ago as well,
so it seems they are rather stable.

There are literally few that are either failing or causing
the Quarantined builds to hang. I manually reviewed the
master tests that failed for last few weeks and added the
tests that are causing the build to hang.

Seems that stability has improved - which might be casued
by some temporary problems when we marked the quarantined builds
or too "generous" way of marking test as quarantined, or
maybe improvement comes from the apache#10368 as the docker engine
and machines used to run the builds in GitHub experience far
less load (image builds are executed in separate builds) so
it might be that resource usage is decreased. Another reason
might be Github Actions stability improvements.

Or simply those tests are more stable when run isolation.

We might still add failing tests back as soon we see them behave
in a flaky way.

The remaining quarantined tests that need to be fixed:
 * test_local_run (often hangs the build)
 * test_retry_handling_job
 * test_clear_multiple_external_task_marker
 * test_should_force_kill_process
 * test_change_state_for_tis_without_dagrun
 * test_cli_webserver_background

We also move some of those tests to "heisentests" category
Those testst run fine in isolation but fail
the builds when run with all other tests:
 * TestImpersonation tests

We might find that those heisentest can be fixed but for
now we are going to run them in isolation.

Also - since those quarantined tests are failing more often
the "num runs" to track for those has been decreased to 10
to keep track of 10 last runs only.

(cherry picked from commit b746f33)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:CLI area:dev-tools area:Scheduler including HA (high availability) scheduler area:webserver Webserver related Issues type:misc/internal Changelog: Misc changes that should appear in change log
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants