Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate deadlock, hanging of tests in CI, whether it is CI, F# tasks, TaskSeq CE or XUnit, see also #25 #27

Closed
wants to merge 5 commits into from

Conversation

abelbraaksma
Copy link
Member

@abelbraaksma abelbraaksma commented Oct 14, 2022

(see at bottom for original issue description and research)

xUnit bug, see this report

It turns out that if the following is true, xUnit ends up in a deadlock in CI, but not locally:

  • You have two or more Task<unit>-returning tests
  • These tests live in separate types or modules
  • At least two of the tasks yield.

The example in xunit/xunit#2587 is given as two tasks that do a Task.Delay(1), which, as we all know, will yield and pause for the length of at least one timer event (which is 15.6ms on most systems). Even though there are no shared variables, it leads to a deadlock.

A good analysis is in this comment: dotnet/BenchmarkDotNet#2114 (comment).
Possibly related: dotnet/corefx#559 (i.e., that the diff between local and CI is that on CI there are simply not enough threads in the thread pool and thread starvation causes the deadlock).

Solution (for now)

There are only two possible solutions that I can think of, given that the issue isn't likely to be solved any time soon (this is not a critique, it is just a nature of this kind of problems):


Original text

Trial and error approach:

  • In #23: we already tried removing specific tests, but this didn't appear to help
  • In #23: we enabled a global timeout through dotnet test with --blame-hang-timeout 15000ms on the commandline in CI
  • In #23: observed that with [<assembly: Xunit.CollectionBehavior(DisableTestParallelization = true)>] the tests run fine
  • Observed that (so far) local repro does not exhibit this issue
    Follow up: running in "Crunch mode" (using NCrunch) for several hours using as many parallel threads as possible gives no issues.
  • Re-enabled parallel testing (this issue) to be able to see if the behavior persists
    Follow-up: the green test runs were either non parallel, or self-parallelized, still unclear. Edit: see above.
  • Run several times in CI to ensure it is stable
    Follow up: it is not. It is an xUnit bug for their CI runner.
  • Run the tests on CI using a reflection-based "grab all tests and run in parallel" approach and crunch multiple times.
    Follow up: CI or Windows or dotnet version on CI has nothing to do with this issue, these tests run fine.
  • Attempt different dotnet test configurations

@abelbraaksma abelbraaksma changed the title Investigate potential race condition or other cause of hanging CI, see #25 Investigate deadlock, hanging of tests in CI, whether it is CI, F# tasks, TaskSeq CE or XUnit, see also #25 Oct 15, 2022
Re-enable parallelization

Blame-timeout to 60s
…ind out whether a test has started (but not finished)

Log "starting a test", because xUnit does not do it
* Skip all tests and run by hand, i.e. reflection-based runner test
* Implement our own test runner
* Run parallelized self-written testrunner several times in parallel
@abelbraaksma
Copy link
Member Author

Closing this, we run "in parallel" locally just fine, both in NCrunch, from commandline and in VS 2022, so this only affects CI, where we disabled "in parallel".

@abelbraaksma abelbraaksma deleted the investigate-race-condition branch October 22, 2022 02:33
@abelbraaksma abelbraaksma added the investigation Requires further or deeper investigation label Oct 22, 2022
@abelbraaksma abelbraaksma added this to the v0.1.0 milestone Mar 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigation Requires further or deeper investigation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant