Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests timing out with incomplete log #35752

Open
safern opened this issue May 2, 2020 · 11 comments
Open

Tests timing out with incomplete log #35752

safern opened this issue May 2, 2020 · 11 comments

Comments

@safern
Copy link
Member

safern commented May 2, 2020

  Discovering: System.Runtime.Extensions.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Runtime.Extensions.Tests (found 837 of 870 test cases)
  Starting:    System.Runtime.Extensions.Tests (parallel test collections = on, max threads = 2)
    System.Tests.EnvironmentTests.GetFolderPath_UWP_NotEmpty [SKIP]
      Condition(s) not met: "IsWindows10Version1709OrGreater", "IsInAppContainer"
    System.Tests.EnvironmentTests.GetFolderPath_UWP_ExistAndAccessible [SKIP]
      Condition(s) not met: "IsWindows10Version1709OrGreater", "IsInAppContainer"
    System.Tests.EnvironmentTests.UserInteractive_WindowsNano [SKIP]
      Condition(s) not met: "IsWindowsNanoServer"
   System.Runtime.Extensions.Tests: [Long Running Test] 'System.Tests.Environment_Exit.ExitCode_VoidMainAppReturnsSetValue', Elapsed: 00:02:28
   System.Runtime.Extensions.Tests: [Long Running Test] 'System.Tests.Environment_Exit.ExitCode_VoidMainAppReturnsSetValue', Elapsed: 00:04:28
   System.Runtime.Extensions.Tests: [Long Running Test] 'System.Tests.Environment_Exit.ExitCode_VoidMainAppReturnsSetValue', Elapsed: 00:06:28
   System.Runtime.Extensions.Tests: [Long Running Test] 'System.Tests.Environment_Exit.ExitCode_VoidMainAppReturnsSetValue', Elapsed: 00:08:28
   System.Runtime.Extensions.Tests: [Long Running Test] 'System.Tests.Environment_Exit.ExitCode_VoidMainAppReturnsSetValue', Elapsed: 00:10:28
   System.Runtime.Extensions.Tests: [Long Running Test] 'System.Tests.Environment_Exit.ExitCode_VoidMainAppReturnsSetValue', Elapsed: 00:12:28
   System.Runtime.Extensions.Tests: [Long Running Test] 'System.Tests.Environment_Exit.ExitCode_VoidMainAppReturnsSetValue', Elapsed: 00:14:28

https://dev.azure.com/dnceng/public/_build/results?buildId=627022&view=ms.vss-test-web.build-test-results-tab&runId=19596708&resultId=183795&paneView=debug

https://helix.dot.net/api/2019-06-17/jobs/6bdbf48d-5deb-436c-8f91-8f96823af464/workitems/System.Runtime.Extensions.Tests/console

Failed in:
#35169

Configuration: netcoreapp5.0-Windows_NT-Debug-x64-CoreCLR_release-Windows.7.Amd64

cc: @jkotas @stephentoub

@safern safern added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label May 2, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Runtime untriaged New issue has not been triaged by the area owner labels May 2, 2020
@safern safern changed the title System.Tests.Environment_Exit.ExitCode_VoidMainAppReturnsSetValu hung in Windows7 x64 release coreclr System.Tests.Environment_Exit.ExitCode_VoidMainAppReturnsSetValue hung in Windows7 x64 release coreclr May 2, 2020
@jkotas
Copy link
Member

jkotas commented May 3, 2020

The log is incomplete. This is likely combination of #35451 and infrastructure problem (machine getting recycled or something similar).

@jkotas jkotas changed the title System.Tests.Environment_Exit.ExitCode_VoidMainAppReturnsSetValue hung in Windows7 x64 release coreclr Tests timing out with incomplete log May 3, 2020
@ghost
Copy link

ghost commented May 3, 2020

Tagging subscribers to this area: @safern, @ViktorHofer
Notify danmosemsft if you want to be subscribed.

@jkotas jkotas added blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' and removed blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' area-System.Runtime labels May 3, 2020
@safern
Copy link
Member Author

safern commented May 3, 2020

I think when the work item exceeds the helix timeout which is 15 mins helix kills the process, that’s why the log is incomplete, but @MattGal can confirm.

@MattGal
Copy link
Member

MattGal commented May 4, 2020

Yes. This is a timeout. I've had the conversation several times recently, and after investigation we determined that we would have to change the Helix API to make this more obvious, as adding a new enum would break clients.

If you just delete "/console" from that URI you get the other log:
https://helix.dot.net/api/2019-06-17/jobs/6bdbf48d-5deb-436c-8f91-8f96823af464/workitems/System.Runtime.Extensions.Tests --> this log

... which contains:

2020-05-02T01:27:10.589Z	INFO   	servicebusrepository(109)	renew_workitem_lock	Renewed work item lock. Status Code: 200
2020-05-02T01:27:53.402Z	INFO   	job(39)	kill	Begin killing timed-out process(es)
2020-05-02T01:27:53.402Z	ERROR  	job(48)	kill	Job running for too long. Killing...
2020-05-02T01:27:53.402Z	INFO   	job(54)	kill	Finished killing timed-out process(es)
2020-05-02T01:27:53.402Z	ERROR  	executor(652)	_execute_command	Executor timed out after 900 seconds and was killed.
2020-05-02T01:27:53.402Z	INFO   	event(42)	send	Sending event type WorkItemTimeout
2020-05-02T01:27:53.589Z	INFO   	saferequests(87)	request_with_retry	Response complete with status code '201'
2020-05-02T01:27:53.589Z	INFO   	executor(618)	_dump_file_upload	No dump files to upload
2020-05-02T01:27:53.589Z	INFO   	executor(674)	_execute_command	Finished _execute_command, exit code: -3
2020-05-02T01:27:53.589Z	INFO   	logs(153)	__exit__	Uploading self-uploading log file C:\h\logs\24c41cd4-1b79-41c4-9a13-e6dc6c0b89c8.log (Client: a002EXQ)

I sympathize with how this isn't handing you the information you need as directly as you'd like, but if you fix your hangs / slow tests and/or increase your timeouts this will go away.

@jkotas
Copy link
Member

jkotas commented May 4, 2020

Well, we cannot fix our hangs unless we know where they are. We need process dumps taken when things are killed due to timeouts like this.

@ViktorHofer
Copy link
Member

@MattGal @davidfowl I believe helix now allows to collect dumps on hangs, is that right?

@ViktorHofer ViktorHofer removed the untriaged New issue has not been triaged by the area owner label Jul 8, 2020
@MattGal
Copy link
Member

MattGal commented Jul 8, 2020

@MattGal @davidfowl I believe helix now allows to collect dumps on hangs, is that right?

Helix handles setting up its clients in ways that dumps get created and stored to a common folder, and uploading them alongside the results of a run. It does not handle making hang dumps directly because this is something the work item itself needs to have some special knowledge of, otherwise the dumps all end up being the entry point which is usually a script interpreter (cmd.exe / bash) and not at all useful for the person trying to debug the hang. This comes from experience of folks trying to get this sort of thing (e.g. "dumpling").

@davidfowl was working on a way for work items to do this within the context of the work item; I'll defer to him as to where we're at with that work.

@Anipik Anipik added this to the 5.0.0 milestone Jul 8, 2020
@ViktorHofer ViktorHofer self-assigned this Jul 30, 2020
@ViktorHofer
Copy link
Member

Will be fixed with #39923.

@ViktorHofer ViktorHofer modified the milestones: 5.0.0, 6.0.0 Aug 26, 2020
@am11
Copy link
Member

am11 commented Sep 25, 2020

I am getting another kind of timeout from coreclr Windows NT x64 Debug leg in #42583, where no test failure is shown in logs, but Helix SDK reports:

.packages\microsoft.dotnet.helix.sdk\5.0.0-beta.20471.1\tools\Microsoft.DotNet.Helix.Sdk.MultiQueue.targets(54,5): error MSB4181: (NETCORE_ENGINEERING_TELEMETRY=Build) The "WaitForHelixJobCompletion" task returned false but did not log an error.

and the executing DevOps agent reports:

##[error]The job running on agent NetCorePublic-Pool 95 ran longer than the maximum time of 150 minutes.

@ViktorHofer, will #39923 also cover these kind of timeouts to help getting info on the rootcause (info like, which test is taking forever)?

@ViktorHofer
Copy link
Member

ViktorHofer commented Sep 25, 2020

@ViktorHofer, will #39923 also cover these kind of timeouts to help getting info on the rootcause (info like, which test is taking forever)?

Yes, that PR will abort long running tests and create a dump. In this case, the timeout is likely an infra issue where not enough Helix clients are available to run the tests in time.

@ViktorHofer ViktorHofer modified the milestones: 6.0.0, 7.0.0 Aug 4, 2021
@krwq
Copy link
Member

krwq commented Sep 30, 2021

Is this still an issue? I'm seeing that #35451 is closed but #39923 which was supposed to fix the issue was not merged

@jakobbotsch jakobbotsch removed the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label May 3, 2022
@carlossanlop carlossanlop modified the milestones: 7.0.0, Future Jul 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants