Skip to content
This repository has been archived by the owner on Dec 19, 2018. It is now read-only.

dotnet test keeps hanging #1537

Closed
muratg opened this issue Aug 28, 2018 · 18 comments
Closed

dotnet test keeps hanging #1537

muratg opened this issue Aug 28, 2018 · 18 comments
Assignees

Comments

@muratg
Copy link

muratg commented Aug 28, 2018

From @livarcocc on Tuesday, 21 August 2018 23:19:09

From @dustinmoris on August 21, 2018 20:20

Steps to reproduce

  1. Go to https://github.com/giraffe-fsharp/Giraffe
  2. Clone the repo and switch to the develop branch
  3. Run .\build.ps1 (or just run dotnet test ./tests/Giraffe.Tests/Giraffe.Tests.fsproj -f netcoreapp2.1)

The build will succeed.

If you go to the Giraffe.Tests.fsproj file and change

<PackageReference Include="Microsoft.AspNetCore.TestHost" Version="2.0.*" />

to

<PackageReference Include="Microsoft.AspNetCore.TestHost" Version="2.1.*" />

and re-run the tests then you'll see that it will hang forever as if something went into a deadlock.

Previously I was using dotnet xunit which also didn't have problems, but since dotnet xunit has been discontinued I switched over to dotnet test which is giving me problems now.

Also this issue doesn't exist if I downgrade the test project to netcoreapp2.0.

Expected behavior

Tests will not deadlock with Microsoft.AspNetCore.TestHost version 2.1.*.

Actual behavior

Tests hang forever.

Environment data

dotnet --info output:

Dustins-MBP:giraffe dustinmoris$ dotnet --info
.NET Core SDK (reflecting any global.json):
 Version:   2.1.401
 Commit:    91b1c13032

Runtime Environment:
 OS Name:     Mac OS X
 OS Version:  10.13
 OS Platform: Darwin
 RID:         osx.10.13-x64
 Base Path:   /usr/local/share/dotnet/sdk/2.1.401/

Host (useful for support):
  Version: 2.1.3
  Commit:  124038c13e

.NET Core SDKs installed:
  2.0.0 [/usr/local/share/dotnet/sdk]
  2.0.3 [/usr/local/share/dotnet/sdk]
  2.1.3 [/usr/local/share/dotnet/sdk]
  2.1.4 [/usr/local/share/dotnet/sdk]
  2.1.300 [/usr/local/share/dotnet/sdk]
  2.1.301 [/usr/local/share/dotnet/sdk]
  2.1.302 [/usr/local/share/dotnet/sdk]
  2.1.400 [/usr/local/share/dotnet/sdk]
  2.1.401 [/usr/local/share/dotnet/sdk]

.NET Core runtimes installed:
  Microsoft.AspNetCore.All 2.1.0 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.All 2.1.1 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.All 2.1.2 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.All 2.1.3 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.App 2.1.0 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 2.1.1 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 2.1.2 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 2.1.3 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 2.0.0 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.0.3 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.0.4 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.0.5 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.1.0 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.1.1 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.1.2 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.1.3 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]

To install additional .NET Core runtimes or SDKs:
  https://aka.ms/dotnet-download

Copied from original issue: dotnet/cli#9874

Copied from original issue: dotnet/aspnetcore#3443

@muratg
Copy link
Author

muratg commented Aug 28, 2018

From @livarcocc on Tuesday, 21 August 2018 23:19:10

@natemcmaster do you know who owns Microsoft.AspNetCore.TestHost?

@muratg
Copy link
Author

muratg commented Aug 28, 2018

From @livarcocc on Tuesday, 21 August 2018 23:19:10

From @natemcmaster on August 21, 2018 21:11

That's my team. You can move this to aspnet/Home.

@muratg
Copy link
Author

muratg commented Aug 28, 2018

From @natemcmaster on Wednesday, 22 August 2018 00:21:54

@dustinmoris can share more logs? What do you see in console before the tests hang? Also, VSTest diagnostic logs may provide information about which process is stuck. Unlike dotnet xunit, dotnet testuses multiple processes to launch and run tests.dotnet test --diag mylog.txt`. If you take a look inside those logs, you'll be able to see which processes were created and which are hanging. This will help if we need you to gather a process dump.

@muratg
Copy link
Author

muratg commented Aug 28, 2018

From @dustinmoris on Friday, 24 August 2018 22:31:36

@natemcmaster:

Before the tests I just see the normal .NET Core output from building the project:

Dustins-MBP:giraffe dustinmoris$ dotnet build ./tests/Giraffe.Tests/Giraffe.Tests.fsproj
Microsoft (R) Build Engine version 15.8.166+gd4e8d81a88 for .NET Core
Copyright (C) Microsoft Corporation. All rights reserved.

  Restoring packages for /Users/dustinmoris/GitHub/Giraffe/tests/Giraffe.Tests/Giraffe.Tests.fsproj...
  Restore completed in 32.02 ms for /Users/dustinmoris/GitHub/Giraffe/src/Giraffe/Giraffe.fsproj.
  Restore completed in 989.08 ms for /Users/dustinmoris/GitHub/Giraffe/tests/Giraffe.Tests/Giraffe.Tests.fsproj.
  Giraffe -> /Users/dustinmoris/GitHub/Giraffe/src/Giraffe/bin/Debug/netstandard2.0/Giraffe.dll
  Giraffe -> /Users/dustinmoris/GitHub/Giraffe/src/Giraffe/bin/Debug/net461/Giraffe.dll
  Giraffe.Tests -> /Users/dustinmoris/GitHub/Giraffe/tests/Giraffe.Tests/bin/Debug/net461/Giraffe.Tests.dll
  Giraffe.Tests -> /Users/dustinmoris/GitHub/Giraffe/tests/Giraffe.Tests/bin/Debug/netcoreapp2.1/Giraffe.Tests.dll

Build succeeded.
    0 Warning(s)
    0 Error(s)

Time Elapsed 00:00:18.40
Dustins-MBP:giraffe dustinmoris$ dotnet test ./tests/Giraffe.Tests/Giraffe.Tests.fsproj -f netcoreapp2.1 --diag testlog.txt
Microsoft (R) Build Engine version 15.8.166+gd4e8d81a88 for .NET Core
Copyright (C) Microsoft Corporation. All rights reserved.

  Restore completed in 41.46 ms for /Users/dustinmoris/GitHub/Giraffe/tests/Giraffe.Tests/Giraffe.Tests.fsproj.
  Restore completed in 41.44 ms for /Users/dustinmoris/GitHub/Giraffe/src/Giraffe/Giraffe.fsproj.
Build started, please wait...
Build completed.

Test run for /Users/dustinmoris/GitHub/Giraffe/tests/Giraffe.Tests/bin/Debug/netcoreapp2.1/Giraffe.Tests.dll(.NETCoreApp,Version=v2.1)
Microsoft (R) Test Execution Command Line Tool Version 15.8.0
Copyright (c) Microsoft Corporation.  All rights reserved.

Starting test execution, please wait...
Logging Vstest Diagnostics in file: /Users/dustinmoris/GitHub/Giraffe/tests/Giraffe.Tests/testlog.txt
^C
Attempting to cancel the build...
Dustins-MBP:giraffe dustinmoris$ 

Because it hangs I had to cancel the build.

After running dotnet test with the --diag flag it actually created two log files:

testlog.host.18-08-24_23-22-45_88253_4.txt
testlog.txt

Seems like this is the point at which it starts hanging. After this line it just keeps saying Polling on remoteEndPoint:

"TestRunStatistics":{"ExecutedTests":59,"Stats":{"Passed":59}},"ActiveTests":[{"Id":"1c4254f9-7162-8077-35e3-3e0a4cbbd847","FullyQualifiedName":"Giraffe.Tests.HttpContextExtensionsTests.WriteHtmlFileAsync should return html from content folder","DisplayName":"Giraffe.Tests.HttpContextExtensionsTests.WriteHtmlFileAsync should return html from content folder","ExecutorUri":"executor://xunit/VsTestRunner2/netcoreapp","Source":"/Users/dustinmoris/GitHub/Giraffe/tests/Giraffe.Tests/bin/Debug/netcoreapp2.1/Giraffe.Tests.dll","CodeFilePath":null,"LineNumber":0,"Properties":[]}]}}
TpTrace Verbose: 0 : 61463, 5, 2018/08/24, 23:22:48.749, 158168527666476, vstest.console.dll, TestRunRequest:SendTestRunStatsChange: Starting.
TpTrace Verbose: 0 : 61463, 5, 2018/08/24, 23:22:48.749, 158168527877462, vstest.console.dll, InProgress is Giraffe.Tests.HttpContextExtensionsTests.WriteHtmlFileAsync should return html from content folder
TpTrace Information: 0 : 61463, 5, 2018/08/24, 23:22:48.750, 158168528048414, vstest.console.dll, TestRunRequest:SendTestRunStatsChange: Completed.
TpTrace Verbose: 0 : 61463, 5, 2018/08/24, 23:22:48.750, 158168528196945, vstest.console.dll, TcpClientExtensions.MessageLoopAsync: Polling on remoteEndPoint: 127.0.0.1:55624 localEndPoint: 127.0.0.1:55623
TpTrace Verbose: 0 : 61463, 5, 2018/08/24, 23:22:49.755, 158169533575644, vstest.console.dll, TcpClientExtensions.MessageLoopAsync: Polling on remoteEndPoint: 127.0.0.1:55624 localEndPoint: 127.0.0.1:55623

Let me know if there's more information I can provide!

@muratg
Copy link
Author

muratg commented Aug 28, 2018

From @natemcmaster on Monday, 27 August 2018 15:39:47

Thanks for the test logs. As far as I can tell, this rules out issues in the test runner itself. There appears to be some kind of deadlock in the test code. Gathering a process dump is the next step for investigating this. When dotnet test runs, there will be several "dotnet.exe" processes. I recommend using Process Explorer to find the right dotnet.exe process for which to collect a full process dump. The dotnet.exe process which is hanging will have startup arguments like this:

dotnet.exe exec --runtimeconfig "/Users/dustinmoris/GitHub/Giraffe/tests/Giraffe.Tests/bin/Debug/netcoreapp2.1/Giraffe.Tests.runtimeconfig.json" --depsfile "/Users/dustinmoris/GitHub/Giraffe/tests/Giraffe.Tests/bin/Debug/netcoreapp2.1/Giraffe.Tests.deps.json" "/Users/dustinmoris/.nuget/packages/microsoft.testplatform.testhost/15.8.0/lib/netstandard1.5/testhost.dll" --port 55623 --endpoint 127.0.0.1:055623 --role client --parentprocessid 61463 --diag "/Users/dustinmoris/GitHub/Giraffe/tests/Giraffe.Tests/testlog.host.18-08-24_23-22-45_88253_4.txt" --telemetryoptedin false

Once you have a process dump, we can look at where the threads are stuck. You should be able to open .dmp files in Visual Studio.

@dustinmoris
Copy link

Hi,

So I have figured out that the deadlock was caused with Microsoft.AspNetCore.TestHost 2.1.* when I was testing routes which accessed files from ContentRoot, however these tests wouldn't deadlock with the TestHost 2.0.*.

Initially I had this problem on my MBP and on my Win 10 machine, but after changing some tests I have fixed the issue now, however I am still getting a deadlock on AppVeyor, Travis and Azure Pipelines for Ubuntu and Windows images, which is really weird because on two independent Windows 10 machines and on my MacBook Pro I can't reproduce this anymore.

I feel like I am wasting here so much time trying to figure out why tests deadlock after upgrading the NuGet package Microsoft.AspNetCore.TestHost from 2.0 to 2.1. Can someone shed some light on what you guys have changed inbetween those versions or maybe someone from the team who can go on a Google Hangout with me and where I can screenshare and demo the issue. I even have a GitHub project which you guys could just pull to investigate, but I am literally losing patience now.

@dustinmoris
Copy link

dustinmoris commented Oct 20, 2018

You should be able to pull the latest version from the develop branch of Giraffe and then run dotnet test for the Giraffe.Tests.fsproj project.

With TestHost 2.0 the build works perfectly fine.

After updating only the NuGet version of TestHost from 2.0 to 2.1 the dotnet test command deadlocks for the exact same build.

The weird thing is that on my macOS Mojave MBP, Windows 10 Home laptop and Windows 10 Professional Desktop I am not able to get the tests hanging either.

I don't think it is an AppVeyor issue though, because the same also happens in TravisCI and Azure Pipelines.

TestHost 2.0 works perfectly on ALL environments.

There is something incredibly weird going on here and I hope you can look into this as there might be some major flaw with TestHost 2.1.

I'm more than happy to assist, but I would hope that Microsoft could take a little bit more the lead on investigating an issue which has only emerged after you made changes to your TestHost library.

@Tratcher
Copy link
Member

Tratcher commented Oct 20, 2018

FYI I see two changes in TestHost 2.1 that might have an impact on test execution:
Response Stream Pipes
HttpContext pattern

Edit: hadn't read the history above... Were you able to narrow down which test is hanging? You can turn on xunit diagnosticMessages to track which test is running. https://xunit.github.io/docs/configuring-with-json

@Tratcher
Copy link
Member

This doesn't repro locally for me on Win10 either. We use AppVeyor and TravisCI ourselves and haven't seen any issues like this. We have known those systems to be flakier if only because they run so much slower. That can make them prone to uncovering race conditions that you wouldn't normally hit during development. It seems like the only way to repro this right now is to send you a PR and try to debug it.

@dustinmoris
Copy link

Thanks @Tratcher for getting back so quickly and checking this yourself! I have narrowed it down that it has to do with these tests:

What they both have in common is that the TestHost is being created with a content root and that they both stream a file. The streamFile handler is implemented here, which essentially calls into StreamCopyOperation.CopyToAsync.

When I remove both files from the compilation by commenting them in the .fsproj file then the build works fine with TestHost 2.1.

We have known those systems to be flakier if only because they run so much slower. That can make them prone to uncovering race conditions that you wouldn't normally hit during development.

That's a good point, however I think this is not the case here because the tests are deadlocking consistently on 3 different CI environments (AppVeyor, Travis and Azure Pipelines) every single time for many weeks now (since TestHost 2.1 has been released). They are also passing on all my machines consistently and some of them are not particularly strong either. I was thinking that perhaps what the CI servers have in common is that the tests are being executed from a docker container? Not sure how these environments are spawned but that could be a reason or maybe because they all run on a VM/container which has only a single core/CPU allocated?

I've got time this evening to look into this in more detail again if you have some concrete ideas what to try or how to debug it the best I'll do my best to get this done today.

@Tratcher
Copy link
Member

The best thing to do is to narrow it down to a single test that fails. It may take a bit of binary searching where you delete half of the tests and see what happens.

@Tratcher
Copy link
Member

I also see that the tests target both net461 and netcoreapp2.1. I've tried dropping net461 in my PR.

@dustinmoris
Copy link

Ok, I can help you with this, because I've already done a lot of these tests where I was removing everything but a few tests. I'll add commits to your PR if that is ok with you?

@Tratcher
Copy link
Member

Go right ahead.

@Tratcher
Copy link
Member

Tratcher commented Oct 25, 2018

@davidfowl here's another threadpool exhastion variant. giraffe-fsharp/Giraffe#317 (comment)

xunit limits the number of threads dynamically based on the number of cores. On CIs like AppVeyor and Travis that could be a very low number (2?). It also runs tests in parallel. If tests start blocking any of those threads then it can quickly starve the threadpool, block any async operations, and cause a full deadlock.

@Tratcher
Copy link
Member

Tratcher commented Oct 26, 2018

x-thread update. This is mostly resolved by the debugging/discussion at giraffe-fsharp/Giraffe#317. The tests were starving threads via Task.Result and making them properly async addressed the issue. The TestHost 2.1 changes discussed in #1537 (comment) appear to have made TestHost more sensitive to thread starvation, likely by running more callbacks on the threadpool rather than inline. The only question remaining is which of the two checkins caused this and what background operation is stuck.

@Tratcher
Copy link
Member

Closing as there's no plan to pursue this.

@csurfleet
Copy link

I just ran across the same issue, and switching everything to async/await fixed it, cheers all

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants