Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The frustrating state of tails calls in .NET #2191

Closed
mrange opened this issue Jan 25, 2020 · 22 comments
Closed

The frustrating state of tails calls in .NET #2191

mrange opened this issue Jan 25, 2020 · 22 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@mrange
Copy link

mrange commented Jan 25, 2020

The frustrating state of tails calls in .NET

I listened to Microsoft Language & Runtime Community standup on twitch
were they said they like "aggressive" well-informed feedback from the community
so I figured I bring up something that has been rubbing me the wrong way a long
time in .NET; tail call performance.

I am sure that has been discussed before but I am going to add more fuel to the debate.

Tails calls in .NET

In .NET it's possible that annotate calls with a .tail attribute allowing the
jitter to eliminate stack frames and thus avoiding to run out of stack space.

If you are a C# developer you probably never encountered tail calls as the C#
compiler as of now (AFAIK?) doesn't emit .tail attributes (might change with tail return?).

It's different for F# developers though where the compiler does emit .tail
attributes.

The problem

Avoiding running out stack space is a good thing so what's the problem?

The problem is the abysmal performance of tail calls in the general case.

Let me demonstrate by implementing a simple data pipeline based on push rather than pull (IEnumerable<_> is pull based).

// Minimalistic PushStream
//  A PushStream accepts a receiver function that will be called
//  with each value in the PushStream
type 'T PushStream = ('T -> unit) -> unit

module PushStream =
  let inline zero      ()       = LanguagePrimitives.GenericZero
  let inline push      r v      = r v

  // Creates a PushStream with all integers from b to e (inclusive)
  let inline fromRange b e    r = for i = b to e do push r i
  // Maps all values in ps using mapping function f
  let inline map       f   ps r = ps (fun v -> push r (f v))
  // Filters all values in ps using filter function f
  let inline filter    f   ps r = ps (fun v -> if f v then push r v)
  // Sums all values in ps
  let inline sum           ps   = let mutable s = zero () in ps (fun v -> s <- s + v); s

For a simple data pipeline designed to detect the overhead of the pipeline the push stream shows some promising performance when comparing it to LINQ.

// Uses BenchmarkDotNet
type Benchmarks () =
  [<Params (10000, 1000, 100)>] 
  member val public Count = 100 with get, set

  [<Benchmark>]
  member x.SimpleImperativeTest () =
    let mutable i   = x.Count
    let mutable sum = 0L
    while i >= 0 do
      let v =  int64 i
      i     <- i - 1
      if (v &&& 1L) = 0L then
        sum <- sum + (v + 1L)
    sum

  [<Benchmark>]
  member x.SimpleLinqTest () =
    Enumerable.Range(0, x.Count)
      .Select(int64)
      .Where(fun v -> (v &&& 1L) = 0L)
      .Select((+) 1L)
      .Sum()

  [<Benchmark>]
  member x.SimplePushStreamTest () =
    PushStream.fromRange  0 x.Count
    |> PushStream.map     int64
    |> PushStream.filter  (fun v -> (v &&& 1L) = 0L)
    |> PushStream.map     ((+) 1L)
    |> PushStream.sum
BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i5-3570K CPU 3.40GHz (Ivy Bridge), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.1.100
  [Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT DEBUG
  DefaultJob : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
|               Method | Count |          Mean |      Error |     StdDev |
|--------------------- |------ |--------------:|-----------:|-----------:|
| SimpleImperativeTest |   100 |      73.36 ns |   0.259 ns |   0.243 ns |
|       SimpleLinqTest |   100 |   1,550.18 ns |   7.837 ns |   7.331 ns |
| SimplePushStreamTest |   100 |     419.71 ns |   1.768 ns |   1.654 ns |
| SimpleImperativeTest |  1000 |     611.51 ns |   2.081 ns |   1.946 ns |
|       SimpleLinqTest |  1000 |  13,677.24 ns |  47.074 ns |  44.033 ns |
| SimplePushStreamTest |  1000 |   3,576.24 ns |  11.202 ns |  10.478 ns |
| SimpleImperativeTest | 10000 |   5,996.41 ns |  27.344 ns |  25.578 ns |
|       SimpleLinqTest | 10000 | 134,129.03 ns | 443.042 ns | 414.422 ns |
| SimplePushStreamTest | 10000 |  34,988.54 ns | 126.441 ns | 118.273 ns |

The imperative "pipeline" does the best as expected as there's no pipeline
overhead but the push stream is doing a lot better than LINQ. Great right?

Not so fast! If we add a second test which does the same thing in a slightly
different way then what happens?

  [<Benchmark>]
  member x.StructLinqStreamTest () =
    Enumerable.Range(0, x.Count)
      .Select(int64)
      .Select(fun v -> struct ((v &&& 1L) = 0L, v))
      .Select(fun struct (b, v) -> struct (b, v + 1L))
      .Select(fun struct (b, v) -> if b then v else 0L)
      .Sum()

  [<Benchmark>]
  member x.StructPushStreamTest () =
    PushStream.fromRange 0 x.Count
    |> PushStream.map     int64
    |> PushStream.map     (fun v -> struct ((v &&& 1L) = 0L, v))
    |> PushStream.map     (fun struct (b, v) -> struct (b, v + 1L))
    |> PushStream.map     (fun struct (b, v) -> if b then v else 0L)
    |> PushStream.sum

Suddenly the push stream performance is abysmal.

|               Method | Count |            Mean |        Error |       StdDev |
|--------------------- |------ |----------------:|-------------:|-------------:|
| StructLinqStreamTest |   100 |     3,060.52 ns |    15.855 ns |    14.831 ns |
| StructPushStreamTest |   100 |    60,241.31 ns |   256.031 ns |   239.492 ns |
| StructLinqStreamTest |  1000 |    27,813.86 ns |    99.601 ns |    93.167 ns |
| StructPushStreamTest |  1000 |   591,909.97 ns | 1,710.549 ns | 1,600.048 ns |
| StructLinqStreamTest | 10000 |   274,592.50 ns |   591.566 ns |   553.351 ns |
| StructPushStreamTest | 10000 | 5,908,867.08 ns | 9,791.398 ns | 9,158.880 ns |

We see the LINQ now performs 20x times better? What's going on?

Let's fix the problem using magic! Rewrite the push function from this:

// Original push function, just invokes the function r with v
let inline push       r v      = r v

Replace it with this nonsense:

// New invoke. Seems just to add redundant code that does nothing
let inline push      r v      = match r v with () -> ()

Now push stream compares favorably in all cases

|               Method | Count |          Mean |        Error |       StdDev |
|--------------------- |------ |--------------:|-------------:|-------------:|
| SimpleImperativeTest |   100 |      73.52 ns |     0.290 ns |     0.271 ns |
|       SimpleLinqTest |   100 |   1,552.31 ns |     4.984 ns |     4.418 ns |
| SimplePushStreamTest |   100 |     503.42 ns |     3.451 ns |     3.228 ns |
| StructLinqStreamTest |   100 |   3,079.51 ns |    18.761 ns |    17.549 ns |
| StructPushStreamTest |   100 |   2,531.61 ns |     9.913 ns |     9.273 ns |
| SimpleImperativeTest |  1000 |     612.66 ns |     1.832 ns |     1.713 ns |
|       SimpleLinqTest |  1000 |  13,667.12 ns |    41.531 ns |    38.848 ns |
| SimplePushStreamTest |  1000 |   4,346.26 ns |    16.795 ns |    15.710 ns |
| StructLinqStreamTest |  1000 |  28,029.09 ns |    70.510 ns |    58.879 ns |
| StructPushStreamTest |  1000 |  22,131.33 ns |    63.323 ns |    59.232 ns |
| SimpleImperativeTest | 10000 |   6,007.44 ns |    19.996 ns |    18.704 ns |
|       SimpleLinqTest | 10000 | 133,865.57 ns |   355.534 ns |   332.567 ns |
| SimplePushStreamTest | 10000 |  42,625.87 ns |    75.884 ns |    67.270 ns |
| StructLinqStreamTest | 10000 | 321,961.25 ns | 1,074.170 ns | 1,004.780 ns |
| StructPushStreamTest | 10000 | 242,181.19 ns |   419.269 ns |   392.185 ns |

What's going on?

The difference are the tail calls.

With the original push function the IL code to invoke the receiver function looks like this:

// let inline push      r v      = r v

// Tell the jitter that the call is a tail call
IL_000D: tail.
// Call invoke virtually
IL_000F: callvirt  instance !1 class <>::Invoke
// For a tail call the call function has to be followed by ret
IL_0014: ret

The modified push function which doesn't change the meaning of the program at all the IL code looks like this:

// let inline push      r v      = match r v with () -> ()

// Call Invoke (no tail call)
callvirt  instance !1 class [FSharp.Core]Microsoft.FSharp.Core.FSharpFunc`2<int64, class [FSharp.Core]Microsoft.FSharp.Core.Unit>::Invoke(!0)
// Throw away the result (which is the unit value anyway)
IL_0012: pop
// Push the result (the unit value is null)
IL_0013: ldnull
// Done
IL_0014: ret

As the modified push "looks" at the value (match) and then loads a new result this doesn't fit the pattern of a tail call. Thus the compiler doens't inject the .tail attribute.

Why do .tail calls sometimes go fast and sometimes go really really slow?

Let's look at the assembly code when .tail call executes quickly.

Fast .tail calls (Yay!)

// This is the implementation of: PushStream.map (fun v -> struct ((v &&& 1L) = 0L, v))

// What's this? Tiered compilation left behinds?
00007fff`90362010 0f1f440000      nop     dword ptr [rax+rax]
00007fff`90362015 8bc2            mov     eax,edx
// Check is number odd
00007fff`90362017 a801            test    al,1
00007fff`90362019 7512            jne     00007fff`9036202d
// Number is even, pass value down the pipeline
// This is a virtual call so we find the jump address by looking
//  up the value in the vtables
00007fff`9036201b 488b4908        mov     rcx,qword ptr [rcx+8]
00007fff`9036201f 488b01          mov     rax,qword ptr [rcx]
00007fff`90362022 488b4040        mov     rax,qword ptr [rax+40h]
00007fff`90362026 488b4020        mov     rax,qword ptr [rax+20h]
// rax now contains the address of the next step in the pipeline
//  thanks to .tail call we do a jmp here not call
//  this means that when the pipeline finally returns it will return
//  directly to the top loop
00007fff`9036202a 48ffe0          jmp     rax
// Number was odd, clear the result and return
00007fff`9036202d 33c0            xor     eax,eax
00007fff`9036202f c3              ret

So apart from the odd nop in the beginning and the usual vtable dance over
virtual functions (what are the conditions to make the jitter succeed with
devirtualizations?) it's quite ok. Actually suprisingly just 5x slower than the
fast imperative solution considering how much more junk happens in each step.

Let's look at the slow .tail call

Slow .tail calls (Boo!)

// This is the implementation of: PushStream.map (fun v -> struct ((v &&& 1L) = 0L, v))

// Function prelude
00007ff7`d77725e0 56              push    rsi
00007ff7`d77725e1 4883ec40        sub     rsp,40h
00007ff7`d77725e5 c5f877          vzeroupper
00007ff7`d77725e8 4c8d442430      lea     r8,[rsp+30h]
00007ff7`d77725ed c5f857c0        vxorps  xmm0,xmm0,xmm0
00007ff7`d77725f1 c4c17a7f00      vmovdqu xmmword ptr [r8],xmm0
00007ff7`d77725f6 488b7108        mov     rsi,qword ptr [rcx+8]
00007ff7`d77725fa 448bc2          mov     r8d,edx
// Is number odd?
00007ff7`d77725fd 41f6c001        test    r8b,1
00007ff7`d7772601 410f94c0        sete    r8b
00007ff7`d7772605 450fb6c0        movzx   r8d,r8b
// Save results
00007ff7`d7772609 4488442438      mov     byte ptr [rsp+38h],r8b
00007ff7`d777260e 4889542430      mov     qword ptr [rsp+30h],rdx
00007ff7`d7772613 49b8784b6637f87f0000 mov r8,offset 
// Checks: Volatile<LONG>       g_TrapReturningThreads;
coreclr!g_TrapReturningThreads (00007ff8`37664b78)
// If true we need to suspend thread (by calling coreclr!JIT_PollGC ())
00007ff7`d777261d 41833800        cmp     dword ptr [r8],0
00007ff7`d7772621 752e            jne     00007ff7`d7772651
00007ff7`d7772623 4c8bc6          mov     r8,rsi
// Juggling with struct tuple values
00007ff7`d7772626 c5fa6f442430    vmovdqu xmm0,xmmword ptr [rsp+30h]
00007ff7`d777262c c5fa7f442420    vmovdqu xmmword ptr [rsp+20h],xmm0
00007ff7`d7772632 4c8d4c2420      lea     r9,[rsp+20h]
00007ff7`d7772637 48b990d162d7f77f0000 mov rcx,7FF7D762D190h
// Loading vtable to find the address to call to
00007ff7`d7772641 498b10          mov     rdx,qword ptr [r8]
00007ff7`d7772644 488b5240        mov     rdx,qword ptr [rdx+40h]
00007ff7`d7772648 488b5220        mov     rdx,qword ptr [rdx+20h]
// Do the call to next step through: coreclr!JIT_TailCall
00007ff7`d777264c e89ff6c35f      call    coreclr!JIT_TailCall (00007ff8`373b1cf0)
00007ff7`d7772651 e86ad6c35f      call    coreclr!JIT_PollGC (00007ff8`373afcc0)
00007ff7`d7772656 ebcb            jmp     00007ff7`d7772623

So, there's lot more setup here but this is because this function actually needs
a stackframe to store intermediate results. Also I suspect because the stackframe
is needed it has to call coreclr!JIT_TailCall at the end which is the CPU hog.

I don't know exactly what coreclr!JIT_TailCall does but it does a lot when
stepping through the assembly code. However, my suspicion is that its purpose is
eliminate the stackframe and call the next function. While it eliminates the stackframe it adds about 60x overhead to the PushStream pipeline.

Finally let's look at the code for the modified push function to get back
predictable performance

Predictable calls

// Function prelude
00007ff7`d77725e1 4883ec40        sub     rsp,58h
00007ff7`d78625f4 c5f877          vzeroupper
00007ff7`d78625f7 33c0            xor     eax,eax
00007ff7`d78625f9 4889442448      mov     qword ptr [rsp+48h],rax
00007ff7`d78625fe 4889442450      mov     qword ptr [rsp+50h],rax
00007ff7`d7862603 488d442438      lea     rax,[rsp+38h]
00007ff7`d7862608 c5f857c0        vxorps  xmm0,xmm0,xmm0
00007ff7`d786260c c5fa7f00        vmovdqu xmmword ptr [rax],xmm0
00007ff7`d7862610 8bc2            mov     eax,edx
// Is number odd?
00007ff7`d7862612 a801            test    al,1
00007ff7`d7862614 0f94c0          sete    al
00007ff7`d7862617 0fb6c0          movzx   eax,al
// Juggling with struct tuple values
00007ff7`d786261a 88442440        mov     byte ptr [rsp+40h],al
00007ff7`d786261e 4889542438      mov     qword ptr [rsp+38h],rdx
00007ff7`d7862623 c5fa6f442438    vmovdqu xmm0,xmmword ptr [rsp+38h]
00007ff7`d7862629 c5fa7f442448    vmovdqu xmmword ptr [rsp+48h],xmm0
00007ff7`d786262f 488b4908        mov     rcx,qword ptr [rcx+8]
00007ff7`d7862633 c5fa6f442448    vmovdqu xmm0,xmmword ptr [rsp+48h]
00007ff7`d7862639 c5fa7f442428    vmovdqu xmmword ptr [rsp+28h],xmm0
00007ff7`d786263f 488d542428      lea     rdx,[rsp+28h]
// Loading vtable to find the address to call to
00007ff7`d7862644 488b01          mov     rax,qword ptr [rcx]
00007ff7`d7862647 488b4040        mov     rax,qword ptr [rax+40h]
// Call the next step in the pipeline
00007ff7`d786264b ff5020          call    qword ptr [rax+20h]
// The function returns, clear eax
00007ff7`d786264e 33c0            xor     eax,eax
// Deallocate stackframe
00007ff7`d7862650 4883c458        add     rsp,58h
// Return to previous chain
00007ff7`d7862654 c3              ret

Because juggling with struct tuple is more complex than the first example it is
more complex code but at least the next step in the pipeline is invoked without
the need of coreclr!JIT_TailCall which means the performance is reasonable and
predictable.

It turns out that the only overhead of match r v with () -> () ends up being a
xor eax,eax which is essentially free compared to everything else.

Frustration sets in...

I believe for most of us we only have to make that we just have to avoid writing
code that has TRUELY TERRIBLE PERFORMANCE. We can get away with bad performance
in 99% of all code we write.

However, if I need to write performant code in F# (and maybe future versions of
C#) because tail calls are really really slow I have to be very careful to ensure
with the code I write so that the F# compiler don't emit the .tail attribute.

Sure, I have a pattern that allows me to do it in this case but what if F# compiler improves in future releases and eliminate the nonsense code I wrote to avoid tail calls? The F# compiler has compiler options that allows me to suppress tail calls but F# also supports true inlining which means even if my library is compiled without tail calls when the functions are inlined into the calling assembly it might well inject tail calls.

Further, sometimes the tail calls does go faster when no stack frame is needed.

This puts me in a frustrating spot; I want tail calls but I don't want to pay the
price of the worst case tail call performance.

Obviously the best solution would be that tail calls are always faster than normal
calls but I am sure that is tricky to implement (otherwise it would have been done already).

The second best solution for my particular scenario would be, if the tail call
can be faster than normal calls let's do it otherwise fallback to normal calls.
That is probably confusing as then it doesn't always have the correct semantics
of a tail call but for this particular scenario that is what I want.

I realize F# is a small language so I am hoping that the C# compiler will start
emitting .tail attributes so that the big C# community will notice the awkward
performance of tail calls.

Be sure to complain to Microsoft Language & Runtime Community standup on twitch
if you were bored by this rant and want to see less of it.

Regards.

Full sample code

// Turn off tiered compilation
// $env:COMPlus_TieredCompilation="0"

// dotnet core    : 3.1.100
// FSharp.Core    : 4.7.0
// BenchMarkDotNet: 0.12.0

module TailCall =
  open System
  open System.Linq
  open System.Diagnostics

  //    let inline push      r v      = match r v with () -> ()
  //    let inline push      r v      = r v

  // Minimalistic PushStream
  //  A PushStream accepts a receiver function that will be called
  //  with each value in the PushStream
  type 'T PushStream = ('T -> unit) -> unit

  module PushStream =
    let inline zero      ()       = LanguagePrimitives.GenericZero
    let inline push      r v      = r v

    // Creates a PushStream with all integers from b to e (inclusive)
    let inline fromRange b e    r = for i = b to e do push r i
    // Maps all values in ps using mapping function f
    let inline map       f   ps r = ps (fun v -> push r (f v))
    // Filters all values in ps using filter function f
    let inline filter    f   ps r = ps (fun v -> if f v then push r v)
    // Sums all values in ps
    let inline sum           ps   = let mutable s = zero () in ps (fun v -> s <- s + v); s

  module Tests =
    open BenchmarkDotNet.Attributes
    open BenchmarkDotNet.Running

    type Benchmarks () =
      [<Params (10000, 1000, 100)>] 
      member val public Count = 100 with get, set

      [<Benchmark>]
      member x.SimpleImperativeTest () =
        let mutable i   = x.Count
        let mutable sum = 0L
        while i >= 0 do
          let v =  int64 i
          i     <- i - 1
          if (v &&& 1L) = 0L then
            sum <- sum + (v + 1L)
        sum

      [<Benchmark>]
      member x.SimpleLinqTest () =
        Enumerable.Range(0, x.Count)
          .Select(int64)
          .Where(fun v -> (v &&& 1L) = 0L)
          .Select((+) 1L)
          .Sum()

      [<Benchmark>]
      member x.SimplePushStreamTest () =
        PushStream.fromRange  0 x.Count
        |> PushStream.map     int64
        |> PushStream.filter  (fun v -> (v &&& 1L) = 0L)
        |> PushStream.map     ((+) 1L)
        |> PushStream.sum

      [<Benchmark>]
      member x.StructLinqStreamTest () =
        Enumerable.Range(0, x.Count)
          .Select(int64)
          .Select(fun v -> struct ((v &&& 1L) = 0L, v))
          .Select(fun struct (b, v) -> struct (b, v + 1L))
          .Select(fun struct (b, v) -> if b then v else 0L)
          .Sum()

      [<Benchmark>]
      member x.StructPushStreamTest () =
        PushStream.fromRange 0 x.Count
        |> PushStream.map     int64
        |> PushStream.map     (fun v -> struct ((v &&& 1L) = 0L, v))
        |> PushStream.map     (fun struct (b, v) -> struct (b, v + 1L))
        |> PushStream.map     (fun struct (b, v) -> if b then v else 0L)
        |> PushStream.sum

    let now =
      let sw = Stopwatch ()
      sw.Start ()
      fun () -> sw.ElapsedMilliseconds

    let time o a =
      let inline cc n = GC.CollectionCount n

      let v = a ()

      GC.Collect (2, GCCollectionMode.Forced)
      GC.WaitForFullGCComplete () |> ignore

      let bcc0, bcc1, bcc2 = cc 0, cc 1, cc 2

      let before = now ()

      for _ = 1 to o do
        a () |> ignore

      let after = now ()

      let acc0, acc1, acc2 = cc 0, cc 1, cc 2

      v, (after - before), (acc0 - bcc0, acc1 - bcc1, acc2 - bcc2)

    let run argv = 
      let b = BenchmarkSwitcher [|typeof<Benchmarks>|]
      let summary = b.Run argv
      printfn "%A" summary

    // BenchMarkDotNet is good but the runs takes too long for me for experimentation
    //  Then I rely on quickRun
    let quickRun () = 

      let inline testCase n a = n, fun c -> string (a c)

      let benchmarks = Benchmarks ()

      let testCases = 
        [|
          testCase  "simple, imperative"  benchmarks.SimpleImperativeTest
          testCase  "simple, linq"        benchmarks.SimpleLinqTest
          testCase  "simple, pushstream"  benchmarks.SimplePushStreamTest
//          testCase  "struct, linq"        benchmarks.StructLinqStreamTest
//          testCase  "struct, pushstream"  benchmarks.StructPushStreamTest
        |]

      let total   = 100000000
      let inners  = [|1000000; 10000; 100|]
      for inner in inners do
        let outer = total / inner

        benchmarks.Count <- inner

        printfn "Performance test, total: %d, outer: %d, inner: %d" total outer inner

        for n, a in testCases do
          printfn "  Running '%s'..." n
          let v, r, cc = time outer a
          printfn "    result is %A, it took %d ms to produce with (%A) CC" v r cc

[<EntryPoint>]
let main argv =
//  TailCall.Tests.quickRun ()
  TailCall.Tests.run argv
  0

category:cq
theme:tail-call
skill-level:expert
cost:medium

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Jan 25, 2020
@jkotas jkotas added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 25, 2020
@AndyAyersMS
Copy link
Member

@mrange thanks for the feedback.

We have been looking at tail calls recently and have a big change in the works that has not yet landed. It would be good to try these examples with #341 to see if it addresses your concerns.

cc @jakobbotsch @erozenfeld

@mrange
Copy link
Author

mrange commented Jan 26, 2020

Thanks sound great you are working on tail calls. The PR mentions that there is no need to change x86 tail call because that is already fast. From my my experience of x86 tail calls might not be as bad as x64 but it's IMHO not fast. I find that x86 tail call makes the simple pushstream pipeline take 3.6 sec in my quick benchmark vs x86 non-tail that take 0.6 sec. With that overhead I would still try to avoid the overhead of tail calls in the push stream example.

@jakobbotsch
Copy link
Member

It has been a while since I looked at x86 but my impression when I did was that there was not much that could be done to make the general case faster. This is because the calling convention in x86 means that we have to adjust stack pointers and move arguments and the return pointer. There is no easy workaround to make this faster; it is slower simply because it has more stuff to do.

The state of tailcalls is that it is not always an optimization, so compilers will simply not do the transformation when this is the case. But F# is very aggressive at emitting tail. because true tailcalls are sometimes required for correct functional programs. Omitting the prefix simply makes the code incorrect, so that is not an option for the compiler. In your particular case, the program with the match might be faster, but from the perspective of the F# compiler it is an incorrect transformation. And from the perspective of the runtime, the F# compiler has told it that it needs to do a tailcall, so the runtime has to obey this and unfortunately it needs to jump through hoops to do this in some cases–there simply is more work to do.

The work in #341 will make x64 quite a bit faster and add support for tailcalls for other platforms, but it will still be slower than not having tailcalls in the general case. However it will make programs relying on tail. correct.

So to sum up I do not think it is possible to make tail. calls always faster than normal calls. The best you can do is expose some way in F# to let the programmer tell the compiler that a tailcall is not required for a call site. But otherwise you are choosing between fast programs (without tail.) and correct programs (with tail.), and overhead in some cases is to be expected.

@mrange
Copy link
Author

mrange commented Jan 26, 2020

I understand from the respect of jitter it has to implement tail correctly but as for correctness isn't the problem to prevent long tail call chains from running out stackspace? Or are there other differences I am not aware of?

I guess I should register an issue in the F# compiler to allow us suppress .tail attributes reliably.

@jaykrell
Copy link
Contributor

It is a subte area, with much misunderstanding and confusion.

I guess, I think, there should be two tail call operators:
1 a hint to JIT, hey, please try a little, you probably can figure it out, and I don't mind the debugging impact
2 a demand to JIT, hey, please try super duper hard, even if the result is slow.

Presently there is only #2 which I believe F# outputs aggressively, and C# outputs never.
The JIT does usually figure things out, and #1 cannot be trusted, like for passing addresses of locals, so #1 maybe is moot. Maybe F# should be less aggressive with #2 and maybe C# should offer #2 as an option, but not clear.

Maybe the effort is better spent on the compiler frontends noticing near tailcall opportunities and advising developer to change his code a little. I realize these compilers are not that sophisticated.

Maybe the JIT could issue such warnings, but it feels like the wrong time.

Even C++ compilers don't do this, but maybe they should.

Also tailcall serves two purposes, from the two ideas above.

  1. Go a little faster, opportunistically.

  2. Conserve stack at all cost, process unbounded input with bounded stack use.

@AndyAyersMS
Copy link
Member

Maybe the JIT could issue such warnings, but it feels like the wrong time.

Right, it is not easy for the jit to give feedback to the developer. There are tools like the Inlining Analyzer that show some promise here. I am not sure how hard it would be to pull something like this off in F#.

The jit relays tail call outcomes back to the runtime, and these end up as ETL records if you enable the jit tracing keyword (say, by checking the normally unchecked "Jit Inlining" checkbox in perfview). There isn't a nice display for this that I know of; perhaps in the short run we can get something added in perfview or a tail call diagnoser added to BenchmarkDotNet.

image

@mrange
Copy link
Author

mrange commented Jan 27, 2020

The issue I see for a library LINQ-like library that relies on tail calls is that the user of the library will sometimes get really good performance and sometimes really bad performance. They might not be the user that checks ETL records for what is wrong and even if they do they can't really do much to eliminate the tail calls as that is part of the library.

This means that as a LINQ-library developer I need to protect my user from the bad outliers and I have to remove the tail calls somehow.

However, as library developed the ETL records are interesting, thanks for sharing.

Perhaps it's too much to ask for but perhaps you could enlighten me a bit on the complexities of making a tail-call so I learn to appreciate it :)

@jaykrell
Copy link
Contributor

jaykrell commented Jan 27, 2020

@mrange,

The basic reason some things go slow, is that
there are "easy" tailcalls and "difficult" ones.

Roughly speaking, an "easy" one has a key characteristic:

  • the number of outgoing params is equal to or less
    than the incoming params. Or rather, their stack space.
int f1(int a)
{
  return f2(a); // easy
}

int f1(int a)
{
  return f2(a,1,2,4,5,6,7); // difficult,
   // depending on ABI details (how many parameters
   // can be passed in registers.
}

Before working on this area, I would have deemed "difficult"
to be "impossible", but it is not. Desktop CLR has been doing it a while, when F# gives .tail.

There are also criteria, like passing the address of a local:

int f1(int a)
{
  int b;
  return f2(a, ref b); // difficult or impossible
}

And then, depending on ABI details, passing value types by value:

int f1(int a)
{
  struct B b;
  return f2(b); // depends on ABI
}

x86 and ARM pass structs by value "inline", ARM64 and AMD64 pass the address
of a local. So for x86/ARM this is easy, if b is smaller or same size as a (generalized
to the stack space consumed by f1's parameters).

As to how the difficult/slow ones work, see here, it is complicated:

https://github.com/dotnet/runtime/blob/master/docs/design/features/tailcalls-with-helpers.md

As to how the fast/easy ones work, it is really easy.
The function tears down its frame essentially, placing parameters
before or after the teardown, and jmps instead of calls to the target.

There are other cases, like cannot tailcall within try/finally or try/catch.
The calling function must remain on the stack.

There is essentially no downside to the easy/fast cases.
Some people complain about lost frames in call stacks in debugger.

I have seen code that is broken by that, because it does runtime stack walks, but it is bad and rare code imho.
See:
mono/mono#16928

  1. Code is using the name of its caller to form logging strings.

  2. There is a public API that depends on walking to its caller in a static class constructor.
    See: https://github.com/mono/mono/pull/9620/files#diff-9ce66b9dcdea1fbc95cc16fa3bca1954R53
    and https://github.com/mono/mono/pull/9620/files#diff-e7526071878d708680e4a8c5fa5a46edR6232

@AndyAyersMS
Copy link
Member

Ah, there already is a tail call diagnoser for BenchmarkDotNet, though I can't figure out how to get it to compile on this example.

cc @adamsitnik

@timba
Copy link

timba commented Jan 31, 2020

In full framework (not Core) JIT could detect tail recursion in IL and generate iteration instead of recursion even without .tail optcode existed in the IL, but only for x64 Release builds: https://github.com/timba/tailrecursiondotnet

Is this still the same behaviour for Core JIT? And why only x64 Release?

@jkotas
Copy link
Member

jkotas commented Jan 31, 2020

And why only Release?

It is an optimization. The JIT does 99% of optimizations in Release only.
It makes debugging experience worse. The JIT avoids optimizations that make debugging experience worse in Debug.

And why only x64?

For compatibility, the 32-bit JIT for .NET Framework is very different codebase from the 64-bit JIT.

@adamsitnik
Copy link
Member

I can't figure out how to get it to compile on this example

I got it running and shared it here: https://github.com/adamsitnik/TryNewDisassembler/tree/fsharpTailCalls

git clone https://github.com/adamsitnik/TryNewDisassembler
cd TryNewDisassembler
checkout fsharpTailCalls
dotnet run -c Release -f netcoreapp3.0 --filter '*' --runtimes net461 netcoreapp3.0

@mrange
Copy link
Author

mrange commented Jan 31, 2020

Just FYI; tried the github project above and it reproduces the performance numbers I have seen. I learnt abit about BenchMarksDotNet as well from the code :)

One interesting thing to note unrelated to the issue is that the push stream is 20% slower on dotnet core than on .net framework 🤔

There is essentially no downside to the easy/fast cases.
Some people complain about lost frames in call stacks in debugger.

I am personally totally ok losing frames, that's kind of the point IMHO :).

Anyway, I am out of my league here but should/can the jitter be more aggressive on inlining tail calls due to the cost of the complex tail calls?

@AndyAyersMS
Copy link
Member

@adamsitnik Thanks -- did not realize it required an extra package.

When I run this the results look odd... more like it is looking at inlining events than tail call events.

// * Diagnostic Output - TailCallDiagnoser *
--------------------

--------------------
Benchmarks.SimpleImperativeTest: DefaultJob [Count=100]
--------------------

--------------------
Benchmarks.SimpleLinqTest: DefaultJob [Count=100]
--------------------
Caller: Program+TailCall+Tests+Benchmarks.SimpleLinqTest - instance int64  ()
Callee: System.Linq.Enumerable.Select - generic class System.Collections.Generic.IEnumerable`1<!!1>  (class System.Collections.Generic.IEnumerable`1<!!0>,class System.Func`2<!!0,!!1>)
Fail Reason: too many il bytes
--------------------
Caller: Program+TailCall+Tests+Benchmarks.SimpleLinqTest - instance int64  ()
Callee: System.Linq.Enumerable.Select - generic class System.Collections.Generic.IEnumerable`1<!!1>  (class System.Collections.Generic.IEnumerable`1<!!0>,class System.Func`2<!!0,!!1>)
Fail Reason: too many il bytes
--------------------
Caller: Program+TailCall+Tests+Benchmarks.SimpleLinqTest - instance int64  ()
Callee: System.Linq.Enumerable.Sum - int64  (class System.Collections.Generic.IEnumerable`1<int64>)
Fail Reason: has exception handling
--------------------
Caller: Program+TailCall+Tests+Benchmarks.SimpleLinqTest - instance int64  ()
Callee: System.Linq.Enumerable.Range - class System.Collections.Generic.IEnumerable`1<int32>  (int32,int32)
Fail Reason: unprofitable inline
--------------------
Caller: Program+TailCall+Tests+Benchmarks.SimpleLinqTest - instance int64  ()
Callee: System.MulticastDelegate.ThrowNullThisInDelegateToInstance - instance void  ()
Fail Reason: does not return
--------------------
Caller: Program+TailCall+Tests+Benchmarks.SimpleLinqTest - instance int64  ()
Callee: System.MulticastDelegate.ThrowNullThisInDelegateToInstance - instance void  ()
Fail Reason: does not return
--------------------
Caller: Program+TailCall+Tests+Benchmarks.SimpleLinqTest - instance int64  ()
Callee: System.Linq.Enumerable.Where - generic class System.Collections.Generic.IEnumerable`1<!!0>  (class System.Collections.Generic.IEnumerable`1<!!0>,class System.Func`2<!!0,bool>)
Fail Reason: unprofitable inline
--------------------
Caller: Program+TailCall+Tests+Benchmarks.SimpleLinqTest - instance int64  ()
Callee: System.MulticastDelegate.ThrowNullThisInDelegateToInstance - instance void  ()
Fail Reason: does not return
--------------------

@adamsitnik
Copy link
Member

When I run this the results look odd... more like it is looking at inlining events than tail call events.

It has surprised me as well, to the point that I double-checked the source code to make sure that we sign up for the right events. We do:

https://github.com/dotnet/BenchmarkDotNet/blob/5e7f01ddb91f179916661cbc3885052238c27390/src/BenchmarkDotNet.Diagnostics.Windows/TailCallDiagnoser.cs#L44-L83

I am open to any suggestions that could improve the output.

@jakobbotsch
Copy link
Member

When I run this the results look odd... more like it is looking at inlining events than tail call events.

I think I ran into this problem before, and I believe it was an EventPipe issue. The events if you use the in-process events (like in PMI) were right back then.
I thought the EventPipe issue was fixed by @jorive months ago, however, so this may be something else. Unfortunately I cannot find the original issue...

@AndyAyersMS
Copy link
Member

should/can the jitter be more aggressive on inlining tail calls

I think we can be more aggressive, but we also need to be careful; see notes in #12304.

@AndyAyersMS
Copy link
Member

I am open to any suggestions that could improve the output.

Aside from fixing whatever bug is reporting wrong-looking data, it would be good to surface the tail call type and tail prefix for successful tail calls; the "interesting" case is the one where we have a successful call with an explicit tail prefix and the call needed a helper.

@adamsitnik
Copy link
Member

adamsitnik commented Feb 19, 2020

Aside from fixing whatever bug is reporting wrong-looking data

It looks like this issue is fixed in a newer version of TraceEvent. I've sent a PR to BenchmarkDotNet to update the dependency: dotnet/BenchmarkDotNet#1376 and also improve it's performance: dotnet/BenchmarkDotNet#1375 I've also updated the sample app adamsitnik/TryNewDisassembler@d94211b

it would be good to surface the tail call type and tail prefix for successful tail calls

this is already possible, it's configurable via attribute optional parameter:

[<TailCallDiagnoser(logFailuresOnly = false, filterByNamespace = false)>]

@AndyAyersMS in the meantime you can use the following app.

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFrameworks>netcoreapp2.1;netcoreapp2.2;netcoreapp3.0</TargetFrameworks>
  </PropertyGroup>
  <ItemGroup>
    <Compile Include="Program.fs" />
  </ItemGroup>
  <ItemGroup>
    <PackageReference Include="Microsoft.Diagnostics.Tracing.TraceEvent" Version="2.0.49" />
  </ItemGroup>
</Project>
module TailCall =
  open System.Linq

  //    let inline push      r v      = match r v with () -> ()
  //    let inline push      r v      = r v

  // Minimalistic PushStream
  //  A PushStream accepts a receiver function that will be called
  //  with each value in the PushStream
  type 'T PushStream = ('T -> unit) -> unit

  module PushStream =
    let inline zero      ()       = LanguagePrimitives.GenericZero
    let inline push      r v      = r v

    // Creates a PushStream with all integers from b to e (inclusive)
    let inline fromRange b e    r = for i = b to e do push r i
    // Maps all values in ps using mapping function f
    let inline map       f   ps r = ps (fun v -> push r (f v))
    // Filters all values in ps using filter function f
    let inline filter    f   ps r = ps (fun v -> if f v then push r v)
    // Sums all values in ps
    let inline sum           ps   = let mutable s = zero () in ps (fun v -> s <- s + v); s

  module Tests =
    open System
    type Benchmarks () =
      member val public Count = 100 with get, set

      member x.SimpleImperativeTest () =
        let mutable i   = x.Count
        let mutable sum = 0L
        while i >= 0 do
          let v =  int64 i
          i     <- i - 1
          if (v &&& 1L) = 0L then
            sum <- sum + (v + 1L)
        sum

      member x.SimpleLinqTest () =
        Enumerable.Range(0, x.Count)
          .Select(int64)
          .Where(fun v -> (v &&& 1L) = 0L)
          .Select((+) 1L)
          .Sum()

      member x.SimplePushStreamTest () =
        PushStream.fromRange  0 x.Count
        |> PushStream.map     int64
        |> PushStream.filter  (fun v -> (v &&& 1L) = 0L)
        |> PushStream.map     ((+) 1L)
        |> PushStream.sum

      member x.StructLinqStreamTest () =
        Enumerable.Range(0, x.Count)
          .Select(int64)
          .Select(fun v -> struct ((v &&& 1L) = 0L, v))
          .Select(fun struct (b, v) -> struct (b, v + 1L))
          .Select(fun struct (b, v) -> if b then v else 0L)
          .Sum()

      member x.StructPushStreamTest () =
        PushStream.fromRange 0 x.Count
        |> PushStream.map     int64
        |> PushStream.map     (fun v -> struct ((v &&& 1L) = 0L, v))
        |> PushStream.map     (fun struct (b, v) -> struct (b, v + 1L))
        |> PushStream.map     (fun struct (b, v) -> if b then v else 0L)
        |> PushStream.sum

    type ReflectionHelper () =
      member x.UseReflectionToInvokeMethods (methodName : string) =
        let typeInfo = typeof<ReflectionHelper>
        let benchmarkType = typeInfo.Assembly.GetType(typeInfo.FullName.Replace("ReflectionHelper", "Benchmarks"))
        let method = benchmarkType.GetMethod(methodName)
        let instance = Activator.CreateInstance(benchmarkType)
        method.Invoke(instance, null)


open TailCall.Tests
open System
open System.Diagnostics
open System.Threading
open System.Threading.Tasks
open Microsoft.Diagnostics.Tracing.Session
open Microsoft.Diagnostics.Tracing.Parsers
open Microsoft.Diagnostics.Tracing.Parsers.Clr
open Microsoft.Diagnostics.Tracing

[<EntryPoint>]
let main argv =
    let currentProcessId = Process.GetCurrentProcess().Id
    use session = new TraceEventSession("tailCallsSession")
    session.EnableProvider(ClrTraceEventParser.ProviderGuid, TraceEventLevel.Verbose, uint64 ClrTraceEventParser.Keywords.JitTracing, options = null) |> ignore

    let onMethodTailCallSucceeded = System.Action<MethodJitTailCallSucceededTraceData>(fun e ->
        if e.ProcessID = currentProcessId then printfn "OK, callee: %s compiled: %s TailPrefix: %b TailCallType: %A" e.CalleeName e.MethodBeingCompiledName e.TailPrefix e.TailCallType
    )

    let onMethodTailCallFailed = System.Action<MethodJitTailCallFailedTraceData>(fun e ->
        if e.ProcessID = currentProcessId then printfn "NOT, callee: %s compiled: %s FailReason: %A" e.CalleeName e.MethodBeingCompiledName e.FailReason
    )

    let onMethodTailCallFailedAnsi = System.Action<MethodJitTailCallFailedAnsiTraceData>(fun e ->
        if e.ProcessID = currentProcessId then printfn "NOTa, callee: %s compiled: %s FailReason: %A" e.CalleeName e.MethodBeingCompiledName e.FailReason
    )

    session.Source.Clr.add_MethodTailCallSucceeded onMethodTailCallSucceeded
    session.Source.Clr.add_MethodTailCallFailed onMethodTailCallFailed
    session.Source.Clr.add_MethodTailCallFailedAnsi onMethodTailCallFailedAnsi

    // we use reflection helper to make sure that the "Benchmarks" type is not JITed before the tracing has started
    let sut = ReflectionHelper
    let result1 = sut().UseReflectionToInvokeMethods("SimpleImperativeTest")
    let result2 = sut().UseReflectionToInvokeMethods("SimpleLinqTest")
    let result3 = sut().UseReflectionToInvokeMethods("SimplePushStreamTest")
    let result4 = sut().UseReflectionToInvokeMethods("StructLinqStreamTest")
    let result5 = sut().UseReflectionToInvokeMethods("StructPushStreamTest")
    printfn "%A %A %A %A %A" result1 result2 result3 result4 result5

    let action = System.Action(fun _ -> session.Source.Process() |> ignore)
    let task = Task.Factory.StartNew(action, TaskCreationOptions.LongRunning)

    // give it some time & wait for delayed events
    let timeToSleep = TimeSpan.FromSeconds(10.0)
    Thread.Sleep(timeToSleep)

    0 // return an integer exit code

@AndyAyersMS
Copy link
Member

Thanks @adamsitnik -- works well. Had to add in the namespace part to make sense of things, though:

caller: Diag+TailCall+Tests+StructPushStreamTest@65-4.Invoke 
callee: Microsoft.FSharp.Core.FSharpFunc`2[System.ValueTuple`2[System.Boolean,System.Int64],System.__Canon].Invoke 
TailPrefix: true 
TailCallType: HelperAssistedTailCall

At any rate, we should try running the code here with the tail call changes from #341. Will put it on my todo list but may not get to it for a while.

@erozenfeld
Copy link
Member

I ran the benchmark with the changes from #341. That change speeds up StructPushStreamTest 7-8 times. Here are the numbers (before is current, after is with #341):

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.1.20155.7
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.12005, CoreFX 5.0.20.12005), X64 RyuJIT DEBUG
  Job-CBUONO : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
  Job-KATQIF : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT

|               Method | Toolchain | Count |           Mean |        Error |       StdDev | 
|--------------------- |-------------------------------------------------------------------
| SimpleImperativeTest | after     |   100 |       106.3 ns |      1.04 ns |      0.98 ns |
| SimpleImperativeTest | before    |   100 |       108.6 ns |      1.21 ns |      1.13 ns |
|                      |           |       |                |              |              |
|       SimpleLinqTest | after     |   100 |     1,207.8 ns |     16.77 ns |     15.68 ns |
|       SimpleLinqTest | before    |   100 |     1,279.2 ns |     21.44 ns |     19.01 ns |
|                      |           |       |                |              |              |
| SimplePushStreamTest | after     |   100 |       360.7 ns |      3.60 ns |      3.37 ns |
| SimplePushStreamTest | before    |   100 |       382.6 ns |      7.44 ns |      6.96 ns |
|                      |           |       |                |              |              |
| StructLinqStreamTest | after     |   100 |     2,574.6 ns |     49.93 ns |     46.71 ns |
| StructLinqStreamTest | before    |   100 |     2,689.9 ns |     48.13 ns |     42.66 ns |
|                      |           |       |                |              |              |
| StructPushStreamTest | after     |   100 |     7,054.0 ns |    139.15 ns |    175.98 ns |
| StructPushStreamTest | before    |   100 |    53,018.4 ns |  1,107.14 ns |  2,838.03 ns |
|                      |           |       |                |              |              |
| SimpleImperativeTest | after     |  1000 |       575.7 ns |     11.92 ns |     16.72 ns |
| SimpleImperativeTest | before    |  1000 |       962.8 ns |     19.18 ns |     44.82 ns |
|                      |           |       |                |              |              |
|       SimpleLinqTest | after     |  1000 |    11,382.0 ns |    244.17 ns |    325.96 ns |
|       SimpleLinqTest | before    |  1000 |    11,818.7 ns |    231.53 ns |    216.57 ns |
|                      |           |       |                |              |              |
| SimplePushStreamTest | after     |  1000 |     3,215.6 ns |     63.85 ns |     87.40 ns |
| SimplePushStreamTest | before    |  1000 |     3,343.1 ns |     65.32 ns |     80.21 ns |
|                      |           |       |                |              |              |
| StructLinqStreamTest | after     |  1000 |    25,345.5 ns |    380.99 ns |    356.38 ns |
| StructLinqStreamTest | before    |  1000 |    25,673.0 ns |    502.46 ns |    670.78 ns |
|                      |           |       |                |              |              |
| StructPushStreamTest | after     |  1000 |    72,993.9 ns |  1,481.72 ns |  4,251.33 ns |
| StructPushStreamTest | before    |  1000 |   517,403.6 ns | 10,017.01 ns | 11,133.88 ns |
|                      |           |       |                |              |              |
| SimpleImperativeTest | after     | 10000 |     5,648.8 ns |     94.17 ns |     88.09 ns |
| SimpleImperativeTest | before    | 10000 |    10,365.2 ns |    159.36 ns |    149.06 ns |
|                      |           |       |                |              |              |
|       SimpleLinqTest | after     | 10000 |   108,323.9 ns |  1,096.84 ns |    915.91 ns |
|       SimpleLinqTest | before    | 10000 |   109,242.2 ns |  2,153.00 ns |  3,087.77 ns |
|                      |           |       |                |              |              |
| SimplePushStreamTest | after     | 10000 |    32,878.2 ns |    632.49 ns |    621.19 ns |
| SimplePushStreamTest | before    | 10000 |    33,403.5 ns |    660.98 ns |    926.60 ns |
|                      |           |       |                |              |              |
| StructLinqStreamTest | after     | 10000 |   234,969.1 ns |  1,626.21 ns |  1,521.15 ns |
| StructLinqStreamTest | before    | 10000 |   242,740.5 ns |  2,549.54 ns |  2,260.10 ns |
|                      |           |       |                |              |              |
| StructPushStreamTest | after     | 10000 |   610,634.5 ns | 12,163.64 ns | 29,837.62 ns |
| StructPushStreamTest | before    | 10000 | 4,946,534.6 ns | 96,406.58 ns | 94,684.16 ns |

I'm working with @jakobbotsch on getting #341 merged soon.

@BruceForstall
Copy link
Member

Based on my reading of this issue, #341 will improve the situation a great deal, especially for F#. I don't see any other reason to keep this issue open, so I'm closing it.

@BruceForstall BruceForstall removed the untriaged New issue has not been triaged by the area owner label Apr 4, 2020
@BruceForstall BruceForstall added this to the 5.0 milestone Apr 4, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

10 participants