Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failure: JIT\\Methodical\\tailcall_v4\\hijacking\\hijacking.cmd #40916

Closed
v-haren opened this issue Aug 17, 2020 · 5 comments · Fixed by #41340
Closed

Test failure: JIT\\Methodical\\tailcall_v4\\hijacking\\hijacking.cmd #40916

v-haren opened this issue Aug 17, 2020 · 5 comments · Fixed by #41340
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Milestone

Comments

@v-haren
Copy link

v-haren commented Aug 17, 2020

failed in job: runtime-coreclr outerloop 20200816.2

Error message

cmdLine:C:\h\w\C1360A0E\w\BF880A6F\e\JIT\Methodical\tailcall_v4\hijacking\hijacking.cmd Timed Out (timeout in milliseconds: 600000 from variable __TestTimeout, start: 8/17/2020 2:17:27 AM, end: 8/17/2020 2:27:27 AM)

Return code:      -100
Raw output file:      C:\h\w\C1360A0E\w\BF880A6F\e\JIT\Methodical\Reports\JIT.Methodical\tailcall_v4\hijacking\hijacking.output.txt
Raw output:
BEGIN EXECUTION
        1 file(s) copied.
" "C:\h\w\C1360A0E\p\crossgen.exe"  /Platform_Assemblies_Paths C:\h\w\C1360A0E\p;C:\h\w\C1360A0E\w\BF880A6F\e\JIT\Methodical\tailcall_v4\hijacking /in C:\h\w\C1360A0E\w\BF880A6F\e\JIT\Methodical\tailcall_v4\hijacking\hijacking.org /out C:\h\w\C1360A0E\w\BF880A6F\e\JIT\Methodical\tailcall_v4\hijacking\\hijacking.dll"
Microsoft (R) CoreCLR Native Image Generator - Version 5.0.0-ci
Copyright (c) Microsoft Corporation.  All rights reserved.

Native image C:\h\w\C1360A0E\w\BF880A6F\e\JIT\Methodical\tailcall_v4\hijacking\\hijacking.dll generated successfully.
 "C:\h\w\C1360A0E\p\corerun.exe" hijacking.dll 

cmdLine:C:\h\w\C1360A0E\w\BF880A6F\e\JIT\Methodical\tailcall_v4\hijacking\hijacking.cmd Timed Out (timeout in milliseconds: 600000 from variable __TestTimeout, start: 8/17/2020 2:17:27 AM, end: 8/17/2020 2:27:27 AM)
Test Harness Exitcode is : -100
To run the test:
> set CORE_ROOT=C:\h\w\C1360A0E\p
> C:\h\w\C1360A0E\w\BF880A6F\e\JIT\Methodical\tailcall_v4\hijacking\hijacking.cmd
Expected: True
Actual:   False


Stack trace
   at JIT_Methodical._tailcall_v4_hijacking_hijacking_._tailcall_v4_hijacking_hijacking_cmd() in F:\workspace\_work\1\s\artifacts\tests\coreclr\Windows_NT.x64.Checked\TestWrappers\JIT.Methodical\JIT.Methodical.XUnitWrapper.cs:line 182791
@v-haren v-haren added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 17, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Aug 17, 2020
@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Aug 17, 2020
@JulieLeeMSFT JulieLeeMSFT added this to the 5.0.0 milestone Aug 17, 2020
@jkotas
Copy link
Member

jkotas commented Aug 19, 2020

Failed in #41033 (regular CI run):

:\h\w\A8010978\w\A3EC0928\e>dotnet C:\h\w\A8010978\p\xunit.console.dll JIT\Methodical\JIT.Methodical.XUnitWrapper.dll -parallel collections -nocolor -noshadow -xml testResults.xml -trait TestGroup=JIT.Methodical 
Microsoft.DotNet.XUnitConsoleRunner v2.5.0 (64-bit .NET 5.0.0-preview.8.20361.2)
  Discovering: JIT.Methodical.XUnitWrapper
  Discovered:  JIT.Methodical.XUnitWrapper
  Starting:    JIT.Methodical.XUnitWrapper
Attempting to collect crash dump: C:\cores\crashdump_5076.dmp
Invoking: C:\h\w\A8010978\p\createdump.exe --name "C:\cores\crashdump_5076.dmp" 5076 --withheap
createdump stdout:
Writing minidump with heap to file C:\cores\crashdump_5076.dmp
Dump successfully written

createdump stderr:

Collected crash dump: C:\cores\crashdump_5076.dmp
    JIT\Methodical\tailcall_v4\hijacking\hijacking.cmd [FAIL]

@AndyAyersMS
Copy link
Member

Failure is a timeout after 10 minutes. Looking at recent history for this configuration (x64 R24 NoTC) runs take from 90s - 600s (+); the only other configuration I see with failures is (x64 NoTC). Arm64 completes these in 30s or so.

@jakobbotsch
Copy link
Member

jakobbotsch commented Aug 25, 2020

C# of this test is:

using System;
using System.Threading;

internal struct ForceHelper
{
  public unsafe ForceHelper* me1;
  public unsafe ForceHelper* me2;
  public long counter;
}

internal class Repro
{
  private volatile bool stop;
  private bool failed;
  private long counts;
  private long gc;

  private unsafe void TailHelper1(ForceHelper h)
  {
    ++h.counter;
    if (this.stop)
    {
      Interlocked.Add(ref this.counts, h.counter);
    }
    else
    {
      if (&h != h.me1)
      {
        if (h.counter > 3L)
        {
          this.failed = true;
          this.stop = true;
          Interlocked.Add(ref this.counts, h.counter);
          return;
        }
        h.me1 = &h;
      }
      /* tail. */ this.TailHelper2(h, h);
    }
  }

  private unsafe void TailHelper2(ForceHelper h, ForceHelper h2)
  {
    ++h.counter;
    if (this.stop)
    {
      Interlocked.Add(ref this.counts, h.counter);
    }
    else
    {
      if (&h != h.me2)
      {
        if (h.counter > 3L)
        {
          this.failed = true;
          this.stop = true;
          Interlocked.Add(ref this.counts, h.counter);
          return;
        }
        h.me2 = &h;
      }
      /* tail. */ this.TailHelper1(h);
    }
  }

  private void TailCallThread()
  {
    /* tail. */ this.TailHelper1(new ForceHelper());
  }

  private void Collector()
  {
    while (!this.stop)
    {
      GC.Collect();
      ++this.gc;
    }
  }

  private static int Main()
  {
    Repro repro = new Repro();
    Thread[] threadArray = new Thread[2]
    {
      new Thread(new ThreadStart(repro.Collector)),
      new Thread(new ThreadStart(repro.TailCallThread))
    };
    foreach (Thread thread in threadArray)
      thread.Start();
    for (int index = 0; index < 30; ++index)
    {
      Thread.Sleep(1000);
      if (repro.stop)
        break;
    }
    repro.stop = true;
    foreach (Thread thread in threadArray)
      thread.Join();
    Console.WriteLine("Executed {0} GCs", (object) repro.gc);
    if (repro.failed)
    {
      Console.WriteLine("Executed {0} tail calls before failing", (object) repro.counts);
      return 101;
    }
    Console.WriteLine("Executed {0} tail calls", (object) repro.counts);
    Console.WriteLine("PASS");
    return 100;
  }
}

So it has two threads using full cores, one doing tailcalls and one calling GC.Collect() in a loop. A third thread invokes Thread.Sleep(1000) 30 times before stopping, so the test should not take much longer than 30 seconds unless this thread is basically never allowed to run.

@jkotas
Copy link
Member

jkotas commented Aug 25, 2020

one calling GC.Collect() in a loop.

A loop like this can pretty much prevent all other threads in the process from running if there are other things running on the machine like it is the case in the test run setup. Something like:

  1. GC is running on one core, all other test threads are stopped for GC, other cores are running other tests
  2. GC is finished running, GC loop thread continues executing, other test threads are eligible to run but there is no free core to run them so the scheduler decides to wait for the next time slice tick to schedule them
  3. GC starts running immediately again, all other threads are stopped for GC without making any progress since last one

It can be taking minutes just to get through the initial test setup phase. Maybe the GC.Collect loop should have Thread.Sleep(1) in it?

@jakobbotsch
Copy link
Member

It looks like the most recent relevant change was by myself: fdaf97e#diff-ad29682edcb1c578b9d43b941c66b861

There I noticed that the test was not using helpers on x64, and I changed it to use helpers (the old unwind-based helpers). It looks like I ran into timeouts after this because I also reduced the intensity of the test a lot (but not touching the GC.Collect loop which stayed from previously).

one calling GC.Collect() in a loop.

A loop like this can pretty much prevent all other threads in the process from running if there are other things running on the machine like it is the case in the test run setup. Something like:

  1. GC is running on one core, all other test threads are stopped for GC, other cores are running other tests
  2. GC is finished running, GC loop thread continues executing, other test threads are eligible to run but there is no free core to run them so the scheduler decides to wait for the next time slice tick to schedule them
  3. GC starts running immediately again, all other threads are stopped for GC without making any progress since last one

It can be taking minutes just to get through the initial test setup phase. Maybe the GC.Collect loop should have Thread.Sleep(1) in it?

Thanks for the explanation, it makes perfect sense -- if there are few available threads then the GC thread has to be preempted between two GCs, which is pretty unlikely. The fix sounds fine to me but I do wonder how it affects the effectivity of this regression test (in any case, the fix seems necessary).

@jkotas jkotas added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Aug 25, 2020
jakobbotsch added a commit to jakobbotsch/runtime that referenced this issue Aug 25, 2020
Calling GC.Collect in a tight loop is unnecessary for the test and
causing timeouts in CI.

Fixes dotnet#40916
jkotas pushed a commit that referenced this issue Aug 25, 2020
Calling GC.Collect in a tight loop is unnecessary for the test and
causing timeouts in CI.

Fixes #40916
github-actions bot pushed a commit that referenced this issue Aug 25, 2020
Calling GC.Collect in a tight loop is unnecessary for the test and
causing timeouts in CI.

Fixes #40916
jkotas pushed a commit that referenced this issue Aug 26, 2020
Calling GC.Collect in a tight loop is unnecessary for the test and
causing timeouts in CI.

Fixes #40916

Co-authored-by: Jakob Botsch Nielsen <Jakob.botsch.nielsen@gmail.com>
@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants