Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests crashing in CI with no dump: exit code 137 means SIGKILL Killed #97049

Closed
jkotas opened this issue Jan 16, 2024 · 12 comments
Closed

Tests crashing in CI with no dump: exit code 137 means SIGKILL Killed #97049

jkotas opened this issue Jan 16, 2024 · 12 comments
Labels
area-ExceptionHandling-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' Known Build Error Use this to report build issues in the .NET Helix tab
Milestone

Comments

@jkotas
Copy link
Member

jkotas commented Jan 16, 2024

Build Information

Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=527171
Build error leg or test failing: System.Text.Json.Tests
Pull request: #96894

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "exit code 137 means SIGKILL Killed",
  "ErrorPattern": "",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Known issue validation

Build: 🔎 https://dev.azure.com/dnceng-public/public/_build/results?buildId=527171
Error message validated: exit code 137 means SIGKILL Killed
Result validation: ✅ Known issue matched with the provided build.
Validation performed at: 1/25/2024 7:09:47 PM UTC

Report

Build Definition Test Pull Request
567012 dotnet/runtime System.IO.Net5Compat.Tests.WorkItemExecution #98463
566833 dotnet/runtime System.IO.Net5Compat.Tests.WorkItemExecution
566781 dotnet/runtime System.IO.Net5Compat.Tests.WorkItemExecution #98450
566756 dotnet/runtime System.IO.Net5Compat.Tests.WorkItemExecution #98451
566582 dotnet/runtime System.IO.Tests.WorkItemExecution
565830 dotnet/runtime System.Runtime.Tests.WorkItemExecution
565824 dotnet/runtime System.Runtime.Tests.WorkItemExecution
565832 dotnet/runtime System.Runtime.Tests.WorkItemExecution
565816 dotnet/runtime System.Runtime.Tests.WorkItemExecution
565833 dotnet/runtime System.Runtime.Tests.WorkItemExecution
565817 dotnet/runtime System.Runtime.Tests.WorkItemExecution
565820 dotnet/runtime System.Runtime.Tests.WorkItemExecution
565836 dotnet/runtime System.Runtime.Tests.WorkItemExecution
565839 dotnet/runtime System.Runtime.Tests.WorkItemExecution
565826 dotnet/runtime System.Runtime.Tests.WorkItemExecution
565841 dotnet/runtime System.Runtime.Tests.WorkItemExecution
565825 dotnet/runtime System.Runtime.Tests.WorkItemExecution
565821 dotnet/runtime System.Runtime.Tests.WorkItemExecution
565822 dotnet/runtime System.Runtime.Tests.WorkItemExecution
565516 dotnet/runtime System.Runtime.Tests.WorkItemExecution #97135
565131 dotnet/runtime System.Runtime.Tests.WorkItemExecution #97878
565109 dotnet/runtime System.IO.Tests.WorkItemExecution #98385
564673 dotnet/runtime System.Runtime.Tests.WorkItemExecution #97881
564516 dotnet/runtime System.IO.Tests.WorkItemExecution #98337
564229 dotnet/runtime System.Runtime.Tests.WorkItemExecution
564232 dotnet/runtime System.Runtime.Tests.WorkItemExecution
564237 dotnet/runtime System.Runtime.Tests.WorkItemExecution
564223 dotnet/runtime System.Runtime.Tests.WorkItemExecution
564230 dotnet/runtime System.Runtime.Tests.WorkItemExecution
564225 dotnet/runtime System.Runtime.Tests.WorkItemExecution
564214 dotnet/runtime System.Runtime.Tests.WorkItemExecution
564216 dotnet/runtime System.Runtime.Tests.WorkItemExecution
564210 dotnet/runtime System.Runtime.Tests.WorkItemExecution
564234 dotnet/runtime System.Runtime.Tests.WorkItemExecution
564235 dotnet/runtime System.Runtime.Tests.WorkItemExecution
564236 dotnet/runtime System.Runtime.Tests.WorkItemExecution
564217 dotnet/runtime System.Runtime.Tests.WorkItemExecution
564215 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562164 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98261
557869 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98134
562928 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98299
562011 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #96370
562893 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98303
562843 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98117
562846 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98281
562835 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98302
562830 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98301
562799 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98298
562788 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98199
562784 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98299
562774 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #97877
562752 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #97537
562684 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #97644
562536 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98277
562528 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98296
562540 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #97508
562456 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98226
562378 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562382 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562384 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562379 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562375 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562424 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98294
562428 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #95001
562381 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562376 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562372 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562412 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98293
562367 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562374 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562368 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562380 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562371 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562395 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98087
562365 dotnet/runtime System.Runtime.Tests.WorkItemExecution
562392 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98291
562353 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98289
562319 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #96650
562305 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #97079
562250 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98152
562222 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #97976
562110 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98284
562076 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98283
562028 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98281
562017 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98263
562020 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98273
562005 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98278
561996 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #97537
561892 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #97508
561889 dotnet/runtime System.Text.Json.Tests.WorkItemExecution #98277
561839 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561838 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561834 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561851 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561855 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561836 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561837 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561854 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561848 dotnet/runtime System.Runtime.Tests.WorkItemExecution
561847 dotnet/runtime System.Runtime.Tests.WorkItemExecution
Displaying 100 of 1066 results

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
19 249 1066
@jkotas jkotas added blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' Known Build Error Use this to report build issues in the .NET Helix tab labels Jan 16, 2024
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jan 16, 2024
@ghost
Copy link

ghost commented Jan 16, 2024

Tagging subscribers to this area: @dotnet/area-infrastructure-libraries
See info in area-owners.md if you want to be subscribed.

Issue Details

Build Information

Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=527171
Build error leg or test failing: System.IO.Tests.File_GetSetTimes_SafeFileHandle.WritingShouldUpdateWriteTime_After_SetLastAccessTime
Pull request: #96894

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "exit code 137 means SIGKILL Killed eg by kill",
  "ErrorPattern": "",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}
Author: jkotas
Assignees: -
Labels:

area-Infrastructure-libraries, blocking-clean-ci, Known Build Error

Milestone: -

@jkotas
Copy link
Member Author

jkotas commented Jan 16, 2024

  Discovering: System.Text.Json.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Text.Json.Tests (found 7275 of 7334 test cases)
  Starting:    System.Text.Json.Tests (parallel test collections = on, max threads = 6)
./RunTests.sh: line 180:  5589 Killed: 9               "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Text.Json.Tests.runtimeconfig.json --depsfile System.Text.Json.Tests.deps.json xunit.console.dll System.Text.Json.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/private/tmp/helix/working/B0B70943/w/A9E50974/e
----- end Mon Jan 15 01:12:38 PST 2024 ----- exit code 137 ----------------------------------------------------------
exit code 137 means SIGKILL Killed eg by kill

We need dumps to make this diagnosable.

@agocke
Copy link
Member

agocke commented Jan 16, 2024

SIGKILL is a pretty unusual way to take down the process... do we know if there's anything in the infra which can produce a SIGKILL?

@jkotas
Copy link
Member Author

jkotas commented Jan 16, 2024

Exit code 137 can be caused by OOM.

@agocke
Copy link
Member

agocke commented Jan 16, 2024

It doesn't look like we have a mechanism to grab dumps if it is OOM, though: #52521

@danmoseley danmoseley changed the title Tests crashing in CI with no dump: exit code 137 means SIGKILL Killed eg by kill Tests crashing in CI with no dump: exit code 137 means SIGKILL Killed Jan 25, 2024
@ViktorHofer ViktorHofer added this to the Future milestone Jan 29, 2024
@ViktorHofer ViktorHofer removed the untriaged New issue has not been triaged by the area owner label Jan 29, 2024
@EgorBo
Copy link
Member

EgorBo commented Feb 7, 2024

This seems to fail consistently on all PRs

@jkotas
Copy link
Member Author

jkotas commented Feb 13, 2024

I was able to catch a live local repro and attach debugger to it. There is one run away thread with extremely deep stack trace. All other threads are waiting for the GC suspension to finish.

The run-away thread keeps allocating memory at very fast pace. You can see that by running top command in a second shell. Once it allocates about 100GB, the process gets killed.

The repro is sensitive to timing. It stopped reproing for me if I added any kind of verbose logging.

(lldb) bt
* thread #20
  * frame #0: 0x00007fff205944a8 libsystem_pthread.dylib`___chkstk_darwin + 96
    frame #1: 0x0000000000000020  * frame #0: 0x00007fff205944a8 libsystem_pthread.dylib`___chkstk_darwin + 96
    frame #1: 0x0000000000000020
    frame #2: 0x000000010a0370c8 libcoreclr.dylib`SEHProcessException(PAL_SEHException*) + 408
    frame #3: 0x000000010a0a0def libcoreclr.dylib`PAL_DispatchExceptionInner(_CONTEXT*, _EXCEPTION_RECORD*) + 191
    frame #4: 0x000000010a0a0cf8 libcoreclr.dylib`PAL_DispatchException + 72
    frame #5: 0x000000010a0a0694 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 10
    frame #6: 0x00007fff205944a8 libsystem_pthread.dylib`___chkstk_darwin + 96
    frame #7: 0x000000010a0370c8 libcoreclr.dylib`SEHProcessException(PAL_SEHException*) + 408
    frame #8: 0x000000010a0a0def libcoreclr.dylib`PAL_DispatchExceptionInner(_CONTEXT*, _EXCEPTION_RECORD*) + 191
    frame #9: 0x000000010a0a0cf8 libcoreclr.dylib`PAL_DispatchException + 72
    frame #10: 0x000000010a0a0694 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 10
    frame #11: 0x00007fff205944a8 libsystem_pthread.dylib`___chkstk_darwin + 96
    frame #12: 0x000000010a0370c8 libcoreclr.dylib`SEHProcessException(PAL_SEHException*) + 408
    frame #13: 0x000000010a0a0def libcoreclr.dylib`PAL_DispatchExceptionInner(_CONTEXT*, _EXCEPTION_RECORD*) + 191
    frame #14: 0x000000010a0a0cf8 libcoreclr.dylib`PAL_DispatchException + 72
    frame #15: 0x000000010a0a0694 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 10
    frame #16: 0x00007fff205944a8 libsystem_pthread.dylib`___chkstk_darwin + 96
    frame #17: 0x000000010a0370c8 libcoreclr.dylib`SEHProcessException(PAL_SEHException*) + 408
    frame #18: 0x000000010a0a0def libcoreclr.dylib`PAL_DispatchExceptionInner(_CONTEXT*, _EXCEPTION_RECORD*) + 191
    frame #19: 0x000000010a0a0cf8 libcoreclr.dylib`PAL_DispatchException + 72
    frame #20: 0x000000010a0a0694 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 10
    frame #21: 0x00007fff205944a8 libsystem_pthread.dylib`___chkstk_darwin + 96
    frame #22: 0x000000010a0370c8 libcoreclr.dylib`SEHProcessException(PAL_SEHException*) + 408
    frame #23: 0x000000010a0a0def libcoreclr.dylib`PAL_DispatchExceptionInner(_CONTEXT*, _EXCEPTION_RECORD*) + 191
    frame #24: 0x000000010a0a0cf8 libcoreclr.dylib`PAL_DispatchException + 72
    frame #25: 0x000000010a0a0694 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 10
...
    frame #299994: 0x000000010a0a0cf8 libcoreclr.dylib`PAL_DispatchException + 72
    frame #299995: 0x000000010a0a0694 libcoreclr.dylib`PAL_DispatchExceptionWrapper + 10
    frame #299996: 0x00007fff205944a8 libsystem_pthread.dylib`___chkstk_darwin + 96
    frame #299997: 0x000000010a0370c8 libcoreclr.dylib`SEHProcessException(PAL_SEHException*) + 408
    frame #299998: 0x000000010a0a0def libcoreclr.dylib`PAL_DispatchExceptionInner(_CONTEXT*, _EXCEPTION_RECORD*) + 191
    frame #299999: 0x000000010a0a0cf8 libcoreclr.dylib`PAL_DispatchException + 72
(lldb)

@jkotas
Copy link
Member Author

jkotas commented Feb 13, 2024

@janvorli Could you please take a look? It is hit by nearly all CI jobs and it looks related to your EH refactoring.

@janvorli
Copy link
Member

I will take a look.

@janvorli
Copy link
Member

@jkotas do you happen to know which of the tests in the suite was failing when you were able to repro it? I am currently trying to run the System.Text.Json.Tests in a loop locally on the current main.

@jkotas
Copy link
Member Author

jkotas commented Feb 13, 2024

System.Text.Json.Tests, debug build of libraries, checked build of the runtime, native x64 macOS. I was not able to repro it with emulator on M1.

@jkotas
Copy link
Member Author

jkotas commented Feb 15, 2024

There are multiple issues:

  • This type of crash does not produce dumps to allow diagnosing it (known infra gap)
  • Stackoverflow handling on macOS goes into infinite memory allocation loop. @janvorli is looking into it. I have opened Stackoverflow on macOS x64 hangs #98477 on it.
  • DeepNestedJsonFileTest in System.Text.Json.Tests consumes a lot of stack, the stacktrace is several thousands frames deep. It hits stackoverflow some of the time. The non-deterministic stack consumption of tiered compilation and GC makes the repro non-deterministic. (cc @dotnet/area-system-text-json for awareness). This problem happened to be fixed by Change stack size on all desktop platforms to at least 1.5MB #98007 a few days ago that made the stack size on macOS larger.
  • Unrelated crashes have sneaked in the meantime because they had the same generic "exit code 137 means SIGKILL Killed" error message. New issues should be created for these as appropriate.

@jkotas jkotas closed this as completed Feb 15, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Mar 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-ExceptionHandling-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' Known Build Error Use this to report build issues in the .NET Helix tab
Projects
None yet
Development

No branches or pull requests

5 participants