Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RyuJIT: Intrinsify GetType().TypeHandle.Value and typeof(T).TypeHandle.Value #5973

Closed
omariom opened this issue May 31, 2016 · 16 comments
Closed
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@omariom
Copy link
Contributor

omariom commented May 31, 2016

Sometimes I use IntPtr instead of Type as a key (or part of a complex key) in Dictionary.
Advantages is that it is a value type so I don't keep full blown Type object alive, and GC likes value type fields more than ref ones because less references to follow.

But each time I need to get the value or check if a key exists I have to get Type just to retrieve its handle value.

I suggest intrinsifying these calls:

[obj.]GetType().TypeHandle.Value
typeof(T).TypeHandle.Value

How

In the following code JIT just compares method table pointer in the object header to the address of method table of String type.

[MethodImpl(MethodImplOptions.NoInlining)]
private static bool TestMtCheck(object obj)
{
    return obj.GetType() == typeof(String);
}
 mov rax,7FEF61BDA88h  
 cmp qword ptr [rcx],rax
 sete al  
 movzx eax,al  
 ret  

It could do the same when getting type handle value :

[MethodImpl(MethodImplOptions.NoInlining)]
private static IntPtr TestGetHandle(object obj)
{
    return obj.GetType().TypeHandle.Value;
}

MethodImpl(MethodImplOptions.NoInlining)]
private static IntPtr TestGetHandle()
{
    return typeof(String).TypeHandle.Value;
}
mov rax, qword ptr [rcx]
ret
mov rax, 7FEF61BDA88h  
ret

category:cq
theme:type-intrinsics
skill-level:expert
cost:medium

@sergiy-k
Copy link
Contributor

/cc @RussKeldorph

@RussKeldorph
Copy link
Contributor

@jkotas Would JIT be involved in this one?

@jkotas
Copy link
Member

jkotas commented May 31, 2016

Yes, this would be JIT work.

@jkotas
Copy link
Member

jkotas commented May 31, 2016

There are complications with array type handles and proxy type handles (for remoting on full framework) - the type handle is not equal to methodtable for these.

It may be better to implement fixups for System.Type instead. It would avoid helper call to get to the System.Type. This optimization would be applicable more broadly, and it would make access to the raw handle value cheaper too. I have conversion about it with @cmckinsey @geoffkizer recently.

@AndyAyersMS
Copy link
Member

@jkotas did you ever do any writeup for the fixup idea?

@jkotas
Copy link
Member

jkotas commented Apr 24, 2018

Today, ldtoken is expanded to embedGenericHandle + CORINFO_HELP_TYPEHANDLE_TO_RUNTIMETYPE.

We can add a new method to JIT/EE interface bool expandLdToken(CORINFO_RESOLVED_TOKEN * pResolvedToken, CORINFO_GENERICHANDLE_RESULT * pResult) that would allow short-circuiting this if possible. ldtoken would be the lookup returned by bool expandLdToken(CORINFO_RESOLVED_TOKEN * pResolvedToken, CORINFO_GENERICHANDLE_RESULT * pResult) and we would save CORINFO_HELP_TYPEHANDLE_TO_RUNTIMETYPE call.

@adamsitnik
Copy link
Member

Yesterday I've been working on improving the performance of GC.AllocateUninitializedArray to be able to use it effectively in StringBuilder.

typeof(T[]).TypeHandle.Value (mind the array of T, not just T) used in the following line:

https://github.com/dotnet/coreclr/blob/8bc7fab14d78030914e98b33c70b370d513021f6/src/System.Private.CoreLib/src/System/GC.cs#L679

showed up in the profiles:

obraz

It was just a few %, which itself is not bad. The problem was that this generic code is executed only for value types and having this extra call(s) (.GetRuntimeType and Handle.get_Value) increases the method size and hence prevents from inlining.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 30, 2020
@msftgits msftgits added this to the Future milestone Jan 30, 2020
@VSadov
Copy link
Member

VSadov commented Mar 12, 2020

I had two cases recently where I could use this. (managed StelemRef and GC.Allocate[Uninitialized]Array API)

I believe the issues with array type handles should not be a problem any more.

@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 25, 2020
@MichalStrehovsky
Copy link
Member

I believe the issues with array type handles should not be a problem any more.

Would this still be a problem for pointers and potentially function pointers? (typeof(void*).TypeHandle.Value). AFAIK those don't have a MethodTable either.

I think we could potentially do something cheap for our own internal purposes within CoreLib if instead of looking at the public API surface (RuntimeTypeHandle.get_Value) that might have problems with pointer types, we do this for RuntimeTypeHandle.GetValueInternal. That one is already a RyuJIT intrinsic.

The intrinsic handling of CORINFO_INTRINSIC_RTH_GetValueInternal in RyuJIT requires the tree to have a CORINFO_HELP_TYPEHANDLE_TO_RUNTIMETYPEHANDLE helper call as the this. We currently have CORINFO_HELP_TYPEHANDLE_TO_RUNTIMETYPE followed by a get_TypeHandle call. I think all we need is to recognize that CORINFO_HELP_TYPEHANDLE_TO_RUNTIMETYPE followed by a get_TypeHandle call should be folded into a single CORINFO_HELP_TYPEHANDLE_TO_RUNTIMETYPEHANDLE call and the GetValueInternal intrinsic would kick in.

Would that work?

@jkotas
Copy link
Member

jkotas commented Jan 20, 2021

Would this still be a problem for pointers and potentially function pointers?

I do not think so. The issues with these are about boxing. There is no boxing involved here.

RuntimeTypeHandle.GetValueInternal

This intrisic seems to be doing the exact same thing as RuntimeTypeHandle.Value these days. Can we delete it, and turn the public Value property into intrinsic instead? Then the rest of your suggestion should work nicely for any code our there.

@MichalStrehovsky
Copy link
Member

This intrisic seems to be doing the exact same thing as RuntimeTypeHandle.Value these days. Can we delete it, and turn the public Value property into intrinsic instead?

I looked into it but kind of got lost in it. get_Value is an instance method on a valuetype and those get called quite a bit differently from static methods (GetValueInternal is a static method). That's quite a bit more pattern matching.

IL_0000:  ldtoken    [System.Runtime]System.Object
  IL_0005:  call       class [System.Runtime]System.Type [System.Runtime]System.Type::GetTypeFromHandle(valuetype [System.Runtime]System.RuntimeTypeHandle)
  IL_000a:  callvirt   instance valuetype [System.Runtime]System.RuntimeTypeHandle [System.Runtime]System.Type::get_TypeHandle()
  IL_000f:  stloc.0
  IL_0010:  ldloca.s   V_0
  IL_0012:  call       instance native int [System.Runtime]System.RuntimeTypeHandle::get_Value()

Optimizing GetValueInternal is easy because we just remove a call. Optimizing get_Value requires finding the store for the ldloc, eliminating that, along with ldloca, and the call.

GetValueInternal would be enough to help our internal use cases.

The public use case that is the motivating example in the first post makes assumptions about the IntPtr that might not hold. Such dictionary would certainly not work in .NET Native as it shipped for UWP apps since the same type can have multiple of these undocumented IntPtr values in there (when compiled with the shared library).

@GrabYourPitchforks
Copy link
Member

As a first pass experiment, I prototyped intrinsicifying Type.TypeHandle (see GrabYourPitchforks@7a6e22e) and saw a consistent constant overhead improvement in benchmarks.

[Benchmark]
[Arguments(2048)]
[Arguments(4096)]
[Arguments(8192)]
[Arguments(16384)]
public byte[] AllocArray(int len)
{
    return GC.AllocateUninitializedArray<byte>(len);
}
Method Job Toolchain len Mean Error StdDev Ratio RatioSD
AllocArray Job-MVWMAI main 2048 64.38 ns 0.337 ns 0.299 ns 1.00 0.00
AllocArray Job-FUDVRJ protomod 2048 61.46 ns 0.405 ns 0.359 ns 0.95 0.01
AllocArray Job-MVWMAI main 4096 71.73 ns 0.818 ns 0.765 ns 1.00 0.00
AllocArray Job-FUDVRJ protomod 4096 69.22 ns 1.345 ns 1.050 ns 0.97 0.02
AllocArray Job-MVWMAI main 8192 80.64 ns 0.495 ns 0.463 ns 1.00 0.00
AllocArray Job-FUDVRJ protomod 8192 78.64 ns 0.838 ns 0.784 ns 0.98 0.01
AllocArray Job-MVWMAI main 16384 103.50 ns 0.556 ns 0.493 ns 1.00 0.00
AllocArray Job-FUDVRJ protomod 16384 102.78 ns 1.025 ns 0.908 ns 0.99 0.01

The prototype commit isn't production-quality, but it gives a first-pass estimate of the potential savings.

@ghost ghost added in-pr There is an active PR which will close this issue when it is merged and removed in-pr There is an active PR which will close this issue when it is merged labels May 5, 2023
@ghost ghost added in-pr There is an active PR which will close this issue when it is merged and removed in-pr There is an active PR which will close this issue when it is merged labels May 5, 2023
@sgf
Copy link

sgf commented Sep 1, 2023

Yesterday I've been working on improving the performance of GC.AllocateUninitializedArray to be able to use it effectively in StringBuilder.

I report a problem. In a certain community in China, someone encountered that the file reading speed of C# is always slower than that of python. Including using new arrays, or using GC.AllocateUninitializedArray to allocate arrays. Finally he found that using Marshal.AllocHGlobal to allocate cache The zone can achieve the same performance as python.
According to his description, when reading a 1.6GB file from ssd,
the built-in read function of C# takes 0.6 seconds (C# (.net6) uses the File.ReadAllBytes method),
while the reading speed of python is 0.4 seconds (python uses BufferedReader's read() method).

I'm not sure if this helps.

This is the link to the problem.

According to its description, the performance is only related to array allocation, and has nothing to do with the API used. When I suggested that he use memory file mapping to read, he said that he tried it, and the speed was also 0.6 seconds. There will be no improvement. This may indicate that when C# internally reads files, the default implementation of buffer allocation wastes 0.2 seconds.

@jkotas
Copy link
Member

jkotas commented Sep 1, 2023

@sgf The impact of this issue on a single buffer allocation is in nanoseconds range. If you are seeing 0.2 seconds delta, the problem is something else. If you are able to reproduce the problem, please open a new issue on it, include details about the .NET and Python versions, OS and hardware.

@sgf
Copy link

sgf commented Sep 1, 2023

@sgf The impact of this issue on a single buffer allocation is in nanoseconds range. If you are seeing 0.2 seconds delta, the problem is something else. If you are able to reproduce the problem, please open a new issue on it, include details about the .NET and Python versions, OS and hardware.

ok,tks

@EgorBo
Copy link
Member

EgorBo commented Oct 5, 2023

I guess this issue can be closed after #85804 (CoreCLR and NativeAOT)

@EgorBo EgorBo closed this as completed Oct 5, 2023
@dotnet dotnet locked as resolved and limited conversation to collaborators Nov 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.