Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test Group Failure: System.Runtime.Tests outerloop #56567

Closed
josalem opened this issue Jul 29, 2021 · 13 comments · Fixed by #57108
Closed

Test Group Failure: System.Runtime.Tests outerloop #56567

josalem opened this issue Jul 29, 2021 · 13 comments · Fixed by #57108
Labels
Milestone

Comments

@josalem
Copy link
Contributor

josalem commented Jul 29, 2021

Noticed these failures when I was investigating some disabled tracing tests in #56507. These failures are unrelated to the tests I turned back on in that PR, so I looked at the history.

net6.0-Linux-Debug-x64-CoreCLR_release-Ubuntu.1804.Amd64.Open

/datadisks/disk1/work/B3F20994/w/C4E20A47/e /datadisks/disk1/work/B3F20994/w/C4E20A47/e
  Discovering: System.Runtime.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Runtime.Tests (found 28 of 6255 test cases)
  Starting:    System.Runtime.Tests (parallel test collections = on, max threads = 2)
./RunTests.sh: line 162: 11202 Killed                  "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Runtime.Tests.runtimeconfig.json --depsfile System.Runtime.Tests.deps.json xunit.console.dll System.Runtime.Tests.dll -xml testResults.xml -nologo -nocolor -trait category=OuterLoop -notrait category=IgnoreForCI -notrait category=failing $RSP_FILE
/datadisks/disk1/work/B3F20994/w/C4E20A47/e
----- end Thu Jul 29 01:24:36 UTC 2021 ----- exit code 137 ----------------------------------------------------------
exit code 137 means SIGKILL Killed eg by kill
ulimit -c value: unlimited
[ 2439.914551] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 2439.914551] 251 total pagecache pages
[ 2439.914552] 0 pages in swap cache
[ 2439.914553] Swap cache stats: add 0, delete 0, find 0/0
[ 2439.914553] Free swap  = 0kB
[ 2439.914553] Total swap = 0kB
[ 2439.914554] 2097038 pages RAM
[ 2439.914554] 0 pages HighMem/MovableOnly
[ 2439.914555] 58679 pages reserved
[ 2439.914555] 0 pages cma reserved
[ 2439.914555] 0 pages hwpoisoned
[ 2439.914556] Tasks state (memory values in pages):
[ 2439.914556] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[ 2439.914560] [    447]     0   447    43216      215   331776        0             0 systemd-journal
[ 2439.914562] [    470]     0   470    24428       43    94208        0             0 lvmetad
[ 2439.914563] [    476]     0   476    11204      566   131072        0         -1000 systemd-udevd
[ 2439.914564] [    523]     0   523     3005      229    69632        0             0 hv_kvp_daemon
[ 2439.914565] [    896] 62583   896    35489      133   184320        0             0 systemd-timesyn
[ 2439.914566] [   1024]   100  1024    20021      151   176128        0             0 systemd-network
[ 2439.914567] [   1062]   101  1062    17697      173   176128        0             0 systemd-resolve
[ 2439.914569] [   1319]     0  1319    20058     3259   204800        0             0 python3
[ 2439.914570] [   1332]     0  1332    15545      168   155648        0             0 systemd-logind
[ 2439.914571] [   1333]     0  1333    42739     1957   229376        0             0 networkd-dispat
[ 2439.914572] [   1336]     0  1336    40270       32    86016        0             0 lxcfs
[ 2439.914573] [   1338]   103  1338    12514      160   143360        0          -900 dbus-daemon
[ 2439.914574] [   1366]     0  1366    72000      214   188416        0             0 accounts-daemon
[ 2439.914575] [   1372]     0  1372    27605       56   114688        0             0 irqbalance
[ 2439.914576] [   1381]     0  1381     7084       51    94208        0             0 atd
[ 2439.914577] [   1382]   102  1382    66817      364   163840        0             0 rsyslogd
[ 2439.914578] [   1391]     0  1391     7938       73    98304        0             0 cron
[ 2439.914579] [   1393]     0  1393   226267     6655   286720        0          -999 containerd
[ 2439.914580] [   1397]     0  1397     4104       38    73728        0             0 agetty
[ 2439.914581] [   1408]     0  1408     3723       32    69632        0             0 agetty
[ 2439.914582] [   1436]     0  1436    72221      197   200704        0             0 polkitd
[ 2439.914583] [   1622]     0  1622     1128       17    53248        0             0 none
[ 2439.914584] [   1785]     0  1785    18076      181   176128        0         -1000 sshd
[ 2439.914585] [   1806]     0  1806    96545     4082   266240        0             0 python3
[ 2439.914586] [   2473]  1000  2473     2899       66    65536        0             0 helix.sh
[ 2439.914588] [   2928]     0  2928   247469    11662   483328        0          -500 dockerd
[ 2439.914589] [   3295]  1000  3295    44341     6852   241664        0             0 python3
[ 2439.914590] [   3299]   106  3299     7150       46    94208        0             0 uuidd
[ 2439.914591] [   3313]  1000  3313    63593     7085   270336        0             0 python3
[ 2439.914592] [   3314]  1000  3314   124773    11968   348160        0             0 python3
[ 2439.914593] [  11190]  1000 11190     1158       16    57344        0             0 sh
[ 2439.914594] [  11192]  1000 11192     1158       17    57344        0             0 execute.sh
[ 2439.914595] [  11194]  1000 11194     2932       83    69632        0             0 bash
[ 2439.914596] [  11202]  1000 11202  2906815  1915794 15781888        0             0 dotnet
[ 2439.914597] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/helix.service,task=dotnet,pid=11202,uid=1000
[ 2439.914636] Out of memory: Killed process 11202 (dotnet) total-vm:11627260kB, anon-rss:7663176kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:15412kB oom_score_adj:0
[ 2440.040540] oom_reaper: reaped process 11202 (dotnet), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Waiting a few seconds for any dump to be written..
cat /proc/sys/kernel/core_pattern: /home/helixbot/dotnetbuild/dumps/core.%u.%p
cat /proc/sys/kernel/core_uses_pid: 0
cat: /proc/sys/kernel/coredump_filter: No such file or directory
cat /proc/sys/kernel/coredump_filter:
Looking around for any Linux dump..
... found no dump in /datadisks/disk1/work/B3F20994/w/C4E20A47/e
+ export _commandExitCode=137

and

net6.0-Linux-Debug-x64-CoreCLR_release-SLES.15.Amd64.Open

~/work/A42C0904/w/A0C3088C/e ~/work/A42C0904/w/A0C3088C/e
  Discovering: System.Runtime.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Runtime.Tests (found 28 of 6255 test cases)
  Starting:    System.Runtime.Tests (parallel test collections = on, max threads = 2)
./RunTests.sh: line 162: 19114 Killed                  "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Runtime.Tests.runtimeconfig.json --depsfile System.Runtime.Tests.deps.json xunit.console.dll System.Runtime.Tests.dll -xml testResults.xml -nologo -nocolor -trait category=OuterLoop -notrait category=IgnoreForCI -notrait category=failing $RSP_FILE
~/work/A42C0904/w/A0C3088C/e
----- end Thu Jul 29 01:40:46 UTC 2021 ----- exit code 137 ----------------------------------------------------------
exit code 137 means SIGKILL Killed eg by kill
ulimit -c value: unlimited
dmesg: read kernel buffer failed: Operation not permitted
Waiting a few seconds for any dump to be written..
cat /proc/sys/kernel/core_pattern: /home/helixbot/dotnetbuild/dumps/core.%u.%p
cat /proc/sys/kernel/core_uses_pid: 0
cat: /proc/sys/kernel/coredump_filter: No such file or directory
cat /proc/sys/kernel/coredump_filter:
Looking around for any Linux dump..
... found no dump in /home/helixbot/work/A42C0904/w/A0C3088C/e

Both appear to be the same failure with little to no other diagnostics information. I see a few other failures in the history in AzDO going as far back as at least June 24th, but I saw failures all the way back into early May. The logs for those builds are gone, so I can't verify that they are the same failures. I stopped going back in the history at May, so I'm not sure how far back this failure goes.

Based on the history, it looks like this test is potentially flakey. It routinely passes, but occasionally fails. Seemingly in pairs, e.g., if one test run fails, there is another failure within a run of the other. All records of the test in AzDO have the exact same duration 00:01:00.00 regardless of pass or fail. I'm not sure how much I trust these records as a result.

I couldn't find an issue tracking this, but feel free to dup if there is already one.

@josalem josalem added this to the 6.0.0 milestone Jul 29, 2021
@ghost
Copy link

ghost commented Jul 29, 2021

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

Noticed these failures when I was investigating some disabled tracing tests in #56507. These failures are unrelated to the tests I turned back on in that PR, so I looked at the history.

net6.0-Linux-Debug-x64-CoreCLR_release-Ubuntu.1804.Amd64.Open

/datadisks/disk1/work/B3F20994/w/C4E20A47/e /datadisks/disk1/work/B3F20994/w/C4E20A47/e
  Discovering: System.Runtime.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Runtime.Tests (found 28 of 6255 test cases)
  Starting:    System.Runtime.Tests (parallel test collections = on, max threads = 2)
./RunTests.sh: line 162: 11202 Killed                  "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Runtime.Tests.runtimeconfig.json --depsfile System.Runtime.Tests.deps.json xunit.console.dll System.Runtime.Tests.dll -xml testResults.xml -nologo -nocolor -trait category=OuterLoop -notrait category=IgnoreForCI -notrait category=failing $RSP_FILE
/datadisks/disk1/work/B3F20994/w/C4E20A47/e
----- end Thu Jul 29 01:24:36 UTC 2021 ----- exit code 137 ----------------------------------------------------------
exit code 137 means SIGKILL Killed eg by kill
ulimit -c value: unlimited
[ 2439.914551] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 2439.914551] 251 total pagecache pages
[ 2439.914552] 0 pages in swap cache
[ 2439.914553] Swap cache stats: add 0, delete 0, find 0/0
[ 2439.914553] Free swap  = 0kB
[ 2439.914553] Total swap = 0kB
[ 2439.914554] 2097038 pages RAM
[ 2439.914554] 0 pages HighMem/MovableOnly
[ 2439.914555] 58679 pages reserved
[ 2439.914555] 0 pages cma reserved
[ 2439.914555] 0 pages hwpoisoned
[ 2439.914556] Tasks state (memory values in pages):
[ 2439.914556] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[ 2439.914560] [    447]     0   447    43216      215   331776        0             0 systemd-journal
[ 2439.914562] [    470]     0   470    24428       43    94208        0             0 lvmetad
[ 2439.914563] [    476]     0   476    11204      566   131072        0         -1000 systemd-udevd
[ 2439.914564] [    523]     0   523     3005      229    69632        0             0 hv_kvp_daemon
[ 2439.914565] [    896] 62583   896    35489      133   184320        0             0 systemd-timesyn
[ 2439.914566] [   1024]   100  1024    20021      151   176128        0             0 systemd-network
[ 2439.914567] [   1062]   101  1062    17697      173   176128        0             0 systemd-resolve
[ 2439.914569] [   1319]     0  1319    20058     3259   204800        0             0 python3
[ 2439.914570] [   1332]     0  1332    15545      168   155648        0             0 systemd-logind
[ 2439.914571] [   1333]     0  1333    42739     1957   229376        0             0 networkd-dispat
[ 2439.914572] [   1336]     0  1336    40270       32    86016        0             0 lxcfs
[ 2439.914573] [   1338]   103  1338    12514      160   143360        0          -900 dbus-daemon
[ 2439.914574] [   1366]     0  1366    72000      214   188416        0             0 accounts-daemon
[ 2439.914575] [   1372]     0  1372    27605       56   114688        0             0 irqbalance
[ 2439.914576] [   1381]     0  1381     7084       51    94208        0             0 atd
[ 2439.914577] [   1382]   102  1382    66817      364   163840        0             0 rsyslogd
[ 2439.914578] [   1391]     0  1391     7938       73    98304        0             0 cron
[ 2439.914579] [   1393]     0  1393   226267     6655   286720        0          -999 containerd
[ 2439.914580] [   1397]     0  1397     4104       38    73728        0             0 agetty
[ 2439.914581] [   1408]     0  1408     3723       32    69632        0             0 agetty
[ 2439.914582] [   1436]     0  1436    72221      197   200704        0             0 polkitd
[ 2439.914583] [   1622]     0  1622     1128       17    53248        0             0 none
[ 2439.914584] [   1785]     0  1785    18076      181   176128        0         -1000 sshd
[ 2439.914585] [   1806]     0  1806    96545     4082   266240        0             0 python3
[ 2439.914586] [   2473]  1000  2473     2899       66    65536        0             0 helix.sh
[ 2439.914588] [   2928]     0  2928   247469    11662   483328        0          -500 dockerd
[ 2439.914589] [   3295]  1000  3295    44341     6852   241664        0             0 python3
[ 2439.914590] [   3299]   106  3299     7150       46    94208        0             0 uuidd
[ 2439.914591] [   3313]  1000  3313    63593     7085   270336        0             0 python3
[ 2439.914592] [   3314]  1000  3314   124773    11968   348160        0             0 python3
[ 2439.914593] [  11190]  1000 11190     1158       16    57344        0             0 sh
[ 2439.914594] [  11192]  1000 11192     1158       17    57344        0             0 execute.sh
[ 2439.914595] [  11194]  1000 11194     2932       83    69632        0             0 bash
[ 2439.914596] [  11202]  1000 11202  2906815  1915794 15781888        0             0 dotnet
[ 2439.914597] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/helix.service,task=dotnet,pid=11202,uid=1000
[ 2439.914636] Out of memory: Killed process 11202 (dotnet) total-vm:11627260kB, anon-rss:7663176kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:15412kB oom_score_adj:0
[ 2440.040540] oom_reaper: reaped process 11202 (dotnet), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Waiting a few seconds for any dump to be written..
cat /proc/sys/kernel/core_pattern: /home/helixbot/dotnetbuild/dumps/core.%u.%p
cat /proc/sys/kernel/core_uses_pid: 0
cat: /proc/sys/kernel/coredump_filter: No such file or directory
cat /proc/sys/kernel/coredump_filter:
Looking around for any Linux dump..
... found no dump in /datadisks/disk1/work/B3F20994/w/C4E20A47/e
+ export _commandExitCode=137

and

net6.0-Linux-Debug-x64-CoreCLR_release-SLES.15.Amd64.Open

~/work/A42C0904/w/A0C3088C/e ~/work/A42C0904/w/A0C3088C/e
  Discovering: System.Runtime.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Runtime.Tests (found 28 of 6255 test cases)
  Starting:    System.Runtime.Tests (parallel test collections = on, max threads = 2)
./RunTests.sh: line 162: 19114 Killed                  "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Runtime.Tests.runtimeconfig.json --depsfile System.Runtime.Tests.deps.json xunit.console.dll System.Runtime.Tests.dll -xml testResults.xml -nologo -nocolor -trait category=OuterLoop -notrait category=IgnoreForCI -notrait category=failing $RSP_FILE
~/work/A42C0904/w/A0C3088C/e
----- end Thu Jul 29 01:40:46 UTC 2021 ----- exit code 137 ----------------------------------------------------------
exit code 137 means SIGKILL Killed eg by kill
ulimit -c value: unlimited
dmesg: read kernel buffer failed: Operation not permitted
Waiting a few seconds for any dump to be written..
cat /proc/sys/kernel/core_pattern: /home/helixbot/dotnetbuild/dumps/core.%u.%p
cat /proc/sys/kernel/core_uses_pid: 0
cat: /proc/sys/kernel/coredump_filter: No such file or directory
cat /proc/sys/kernel/coredump_filter:
Looking around for any Linux dump..
... found no dump in /home/helixbot/work/A42C0904/w/A0C3088C/e

Both appear to be the same failure with little to no other diagnostics information. I see a few other failures in the history in AzDO going as far back as at least June 24th, but I saw failures all the way back into early May. The logs for those builds are gone, so I can't verify that they are the same failures. I stopped going back in the history at May, so I'm not sure how far back this failure goes.

Based on the history, it looks like this test is potentially flakey. It routinely passes, but occasionally fails. Seemingly in pairs, e.g., if one test run fails, there is another failure within a run of the other. All records of the test in AzDO have the exact same duration 00:01:00.00 regardless of pass or fail. I'm not sure how much I trust these records as a result.

I couldn't find an issue tracking this, but feel free to dup if there is already one.

Author: josalem
Assignees: -
Labels:

area-System.Runtime

Milestone: 6.0.0

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Jul 29, 2021
@jeffschwMSFT jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Jul 31, 2021
@noahfalk noahfalk added the blocking-clean-ci-optional Blocking optional rolling runs label Aug 1, 2021
@noahfalk
Copy link
Member

noahfalk commented Aug 1, 2021

@danmoseley
Copy link
Member

Need to find out what was eating 1.2GB memory in the tests/product Killed process 11202 (dotnet) total-vm:11627260kB,

@danmoseley
Copy link
Member

Interestingly, 100% of these SIGKILLS of this test library are on Ubuntu 1804 and SLES 15. Could they have less memory or different config?

next step: either try to repro locally, or perhaps fix #55702 so that we get a dump.

Execute: Web | Desktop | Web (Lens) | Desktop (SAW)

https://engsrvprod.kusto.windows.net/engineeringdata

WorkItems 
| where Started > now(-30d)  
| where FriendlyName == "System.Runtime.Tests"
| where ExitCode == 137 //or ExitCode  == 0
| join kind= inner (
   Jobs  | where Started > now(-30d) | project  QueueName , JobId, Build, Type, Source,
    Branch,
  Pipeline = tostring(parse_json(Properties).DefinitionName),
  Pipeline_Configuration = tostring(parse_json(Properties).configuration),
  OS = QueueName,
  Arch = tostring(parse_json(Properties).architecture)
) on JobId
| where Branch  !startswith "refs/pull"
| summarize count() by ExitCode, QueueName, Branch, Pipeline, Pipeline_Configuration, OS, Arch
| order by count_ desc
ExitCode QueueName Branch Pipeline Pipeline_Configuration OS Arch count_
137 sles.15.amd64.open.rt refs/heads/main runtime-libraries-coreclr outerloop-linux Release sles.15.amd64.open.rt x64 30
137 ubuntu.1804.amd64.open.rt refs/heads/main runtime-libraries-coreclr outerloop-linux Release ubuntu.1804.amd64.open.rt x64 30
137 ubuntu.1804.amd64.open.rt refs/heads/main runtime-libraries-coreclr outerloop Release ubuntu.1804.amd64.open.rt x64 29
137 sles.15.amd64.open.rt refs/heads/main runtime-libraries-coreclr outerloop Release sles.15.amd64.open.rt x64 29
137 ubuntu.1804.amd64.open.svc refs/heads/release/6.0-preview7 runtime-libraries-coreclr outerloop-linux Release ubuntu.1804.amd64.open.svc x64 25
137 ubuntu.1804.amd64.open.svc refs/heads/release/6.0-preview7 runtime-libraries-coreclr outerloop Release ubuntu.1804.amd64.open.svc x64 25
137 sles.15.amd64.open.svc refs/heads/release/6.0-preview7 runtime-libraries-coreclr outerloop-linux Release sles.15.amd64.open.svc x64 25
137 sles.15.amd64.open.svc refs/heads/release/6.0-preview7 runtime-libraries-coreclr outerloop Release sles.15.amd64.open.svc x64 25
137 sles.15.amd64.open.svc refs/heads/release/6.0-preview6 runtime-libraries-coreclr outerloop-linux Release sles.15.amd64.open.svc x64 24
137 ubuntu.1804.amd64.open.svc refs/heads/release/6.0-preview6 runtime-libraries-coreclr outerloop Release ubuntu.1804.amd64.open.svc x64 24
137 ubuntu.1804.amd64.open.svc refs/heads/release/6.0-preview5 runtime-libraries-coreclr outerloop-linux Release ubuntu.1804.amd64.open.svc x64 24
137 ubuntu.1804.amd64.open.svc refs/heads/release/6.0-preview5 runtime-libraries-coreclr outerloop Release ubuntu.1804.amd64.open.svc x64 24
137 sles.15.amd64.open.svc refs/heads/release/6.0-preview6 runtime-libraries-coreclr outerloop Release sles.15.amd64.open.svc x64 24
137 ubuntu.1804.amd64.open.svc refs/heads/release/6.0-preview6 runtime-libraries-coreclr outerloop-linux Release ubuntu.1804.amd64.open.svc x64 24
137 sles.15.amd64.open.svc refs/heads/release/6.0-preview5 runtime-libraries-coreclr outerloop Release sles.15.amd64.open.svc x64 24
137 sles.15.amd64.open.svc refs/heads/release/6.0-preview5 runtime-libraries-coreclr outerloop-linux Release sles.15.amd64.open.svc x64 24

@danmoseley danmoseley modified the milestones: 6.0.0, 7.0.0 Aug 9, 2021
@danmoseley
Copy link
Member

Moving to 7 as this isn't a ship blocker, but it's important that our tests don't crash so we should investigate a little later.

@danmoseley
Copy link
Member

Incidentally dumping FinishedDate column shows this is failing 410 times in the last 30 days across main/Preview branches. That's not good .. it's probably one badly behaving test. Presumably an outerloop test per the table.

We haven't added one of those since April:

C:\git\runtime>git lg -SOuterLoop src/libraries/System.Runtime/tests/**
...
* fd2e5643646 - Add internal Array.Clear method (#51548) (Wed Apr 21 15:38:03 2021 -0700) <Levi Broderick>
...
* 8c0d7c1ebc5 - Optimize the Linguistic String Search Operations (#43065) (Wed Oct 14 06:46:37 2020 -0700) <Tarek Mahmoud Sayed>
...
* c0ddd1c5d16 - Adding public API for Pinned Object Heap allocations (#33526) (Wed Mar 18 05:00:41 2020 +0000) <Vladimir Sadov>
...
* 6b904c0d885 - Add Array.Copy test for very large arrays (dotnet/corefx#42373) (Mon Nov 4 18:05:27 2019 -0500) <Jan Kotas>

Not sure we can go back further in history in the test failures.

@danmoseley
Copy link
Member

I take that back -- it started on April 22 !

WorkItems 
| where FriendlyName == "System.Runtime.Tests"
| where ExitCode == 137 //or ExitCode  == 0
| join kind= inner (
   Jobs  | project  QueueName , JobId, Build, Type, Source,
    Branch,
  Pipeline = tostring(parse_json(Properties).DefinitionName),
  Pipeline_Configuration = tostring(parse_json(Properties).configuration),
  OS = QueueName,
  Arch = tostring(parse_json(Properties).architecture)
) on JobId
| where Branch  !startswith "refs/pull"
| summarize count() by ExitCode, QueueName, Branch, Pipeline, Pipeline_Configuration, OS, Arch, bin(Finished, 1d)
| order by Finished asc
| take 10
ExitCode QueueName Branch Pipeline Pipeline_Configuration OS Arch Finished count_
137 sles.15.amd64.open.rt refs/heads/main runtime-libraries-coreclr outerloop-linux Release sles.15.amd64.open.rt x64 2021-04-22 00:00:00.0000000 1
137 ubuntu.1804.amd64.open.rt refs/heads/main runtime-libraries-coreclr outerloop Release ubuntu.1804.amd64.open.rt x64 2021-04-22 00:00:00.0000000 1
137 sles.15.amd64.open.rt refs/heads/main runtime-libraries-coreclr outerloop Release sles.15.amd64.open.rt x64 2021-04-22 00:00:00.0000000 1
137 ubuntu.1804.amd64.open.rt refs/heads/main runtime-libraries-coreclr outerloop-linux Release ubuntu.1804.amd64.open.rt x64 2021-04-22 00:00:00.0000000 1
137 sles.15.amd64.open.rt refs/heads/main runtime-libraries-coreclr outerloop Release sles.15.amd64.open.rt x64 2021-04-23 00:00:00.0000000 1
137 ubuntu.1804.amd64.open.rt refs/heads/main runtime-libraries-coreclr outerloop-linux Release ubuntu.1804.amd64.open.rt x64 2021-04-23 00:00:00.0000000 1
137 ubuntu.1804.amd64.open.rt refs/heads/main runtime-libraries-coreclr outerloop Release ubuntu.1804.amd64.open.rt x64 2021-04-23 00:00:00.0000000 1
137 sles.15.amd64.open.rt refs/heads/main runtime-libraries-coreclr outerloop-linux Release sles.15.amd64.open.rt x64 2021-04-23 00:00:00.0000000 1
137 ubuntu.1804.amd64.open.rt refs/heads/main runtime-libraries-coreclr outerloop-linux Release ubuntu.1804.amd64.open.rt x64 2021-04-24 00:00:00.0000000 1
137 sles.15.amd64.open.rt refs/heads/main runtime-libraries-coreclr outerloop-linux Release sles.15.amd64.open.rt x64 2021-04-24 00:00:00.0000000 1

So very likely caused by https://github.com/dotnet/runtime/pull/51548/files

We can mark the tests to skip Ubuntu and SLES. They shouldn't be likely to have OS specific bugs, and an OOM Killer termination shouldn't indicate we have a bug.

@pgovind
Copy link
Contributor

pgovind commented Aug 10, 2021

Tagging @GrabYourPitchforks for visibility (I was just triaging the label)

@danmoseley
Copy link
Member

I'll skip them on these OS

@GrabYourPitchforks
Copy link
Member

Interesting. We did add an outerloop test as part of that PR (see here), but it follows the same pattern that Array.Copy already did. Was there any known flakiness in that test prior to this PR?

@danmoseley
Copy link
Member

Not that I see -- not an OOM anyway. Could it be that occasionally the GC does not reclaim the 1GB from the first test by the time the second one tries to allocate?

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Aug 10, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Aug 11, 2021
@GrabYourPitchforks
Copy link
Member

I wonder if it's a memory fragmentation issue. There's enough memory available, but not always as a contiguous block, so things fall over. And having the two tests run one after another exacerbates the fragmentation.

@danmoseley
Copy link
Member

danmoseley commented Aug 11, 2021

@maonis here we have two tests, run immediately one after another, each allocate a 1GB array and then let it go out of scope. This is periodically failing in Linux only on SLES and Ubuntu. On those the oom killer terminates it (with a bit over 1GB committed, per the message).

This did not happen when there was one such test, but only when Levi added a second such test that runs directly after.

There's no product bug here, just curious whether you can shed light on why that might happen when the machine presumably has significantly more memory. And whether you are aware of varying oom killer behaviors between distros.

@ghost ghost locked as resolved and limited conversation to collaborators Sep 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants