Speed up interface checking and casting #49257

benaadams · 2021-03-06T04:49:23Z

16% - 38% faster* for IsInstanceOfInterface & ChkCastInterface

Less arguments in addressing modes [rdx + 8] vs [rdx + r8*8 + 8]; for lower latency and greater CPU port availability for more instruction parallelism (*see "Intel Optimization Reference Manual" sections "3.5.1.2 Using LEA" and "3.5.1.6 Address Calculations")
- 2 operand addressing has latency of 1 clock and 2 ports available so a reciprocal throughput of 0.5 clocks
- 3 operand addressing has latency of 3 clocks and 1 port available so a reciprocal throughput of 1 clock
Reduce branches, conditional jumps reduced from 8 to 5
4 decs exchanged for 1 add with less data depenency

-       xor      r8d, r8d
+       cmp      rcx, 4
+       jge      SHORT G_M5541_IG06                     >---\
        align    [0 bytes]                                  |
-                       ;; bbWeight=0.50 PerfScore 3.50     |
+                       ;; bbWeight=0.50 PerfScore 4.00     |
                                                            |
-G_M57878_IG04:                             <---\           |
-       cmp      qword ptr [rdx+8*r8], rsi      |           |
-       je       SHORT G_M57878_IG06            |           |
-       dec      rcx                            |           |
-       je       SHORT G_M57878_IG05            |           |
-       cmp      qword ptr [rdx+8*r8+8], rsi    |           |
-       je       SHORT G_M57878_IG06            |           |
-       dec      rcx                            |           |
-       je       SHORT G_M57878_IG05            |           |
-       cmp      qword ptr [rdx+8*r8+16], rsi   |           |
-       je       SHORT G_M57878_IG06            |           |
-       dec      rcx                            |           |
-       je       SHORT G_M57878_IG05            |           |
-       cmp      qword ptr [rdx+8*r8+24], rsi   |           |
-       je       SHORT G_M57878_IG06            |           |
-       je       SHORT G_M57878_IG05            |           |
-       add      r8, 4                          |           |
-       jmp      SHORT G_M57878_IG04        >---/           |
-                       ;; bbWeight=4    PerfScore 77.00    |
+G_M5541_IG04:                              <---\           |
+       lea      r8, [rdx+32]                   |           |
+       cmp      qword ptr [rdx], rsi           |           |
+       je       SHORT G_M5541_IG08             |           |
+       cmp      qword ptr [rdx+8], rsi         |           |
+       je       SHORT G_M5541_IG08             |           |
+       cmp      qword ptr [rdx+16], rsi        |           |
+       je       SHORT G_M5541_IG08             |           |
+       cmp      qword ptr [rdx+24], rsi        |           |
+       je       SHORT G_M5541_IG08             |           |
+       mov      rdx, r8                        |           |
+       add      rcx, -4                        |           |
+       cmp      rcx, 4                         |           |
+       jge      SHORT G_M5541_IG04         >---/           |
+                       ;; bbWeight=4    PerfScore 57.00    |
+                                                           |
+G_M5541_IG05:                                              |
+       test     rcx, rcx                                   |
+       je       SHORT G_M5541_IG07                         |
+                       ;; bbWeight=0.50 PerfScore 0.62     |
+                                                           |
+G_M5541_IG06:                              <---\       <---/
+       lea      r8, [rdx+8]                    |
+       cmp      qword ptr [rdx], rsi           |
+       je       SHORT G_M5541_IG08             |
+       mov      rdx, r8                        |
        dec      rcx                            |
+       test     rcx, rcx                       |
+       jg       SHORT G_M5541_IG06         >---/
+                       ;; bbWeight=4    PerfScore 21.00

src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs

EgorBo · 2021-03-06T10:06:52Z

@AndyAyersMS btw, the CastHelpers (also SpanHelpers) are heavily optimized by hands for general cases - I wonder if we should disable PGO for them or even mark as AggressiveOptimization - otherwise we can re-order them because of some loop during start up and mess up other cases.

benaadams · 2021-03-06T19:08:07Z

Updated summary

EgorBo · 2021-03-06T19:58:41Z

Can you run a benchmark, e.g. something like this (for IsInstanceOfClass ): https://gist.github.com/EgorBo/8766a778fa2522fad4e916c567a2b5d7

benaadams · 2021-03-07T02:23:12Z

Can you run a benchmark

Have one but its been running for hours (too many variants); will have to try something shorter

src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs

benaadams · 2021-03-07T05:26:26Z

Can you run a benchmark

Looks good after fixing the condition @jkotas pointed out #49257 (comment)

Interface cast benchmark gist

Method	Implementation	Mean	branch	Ratio
IsInterface1	Interfaces1	2.166 ns	current	1.000
IsInterface1	Interfaces1	1.629 ns	PR	0.752
IsNotImplemented	Interfaces1	2.166 ns	current	1.000
IsNotImplemented	Interfaces1	1.950 ns	PR	0.900

IsInterface1	Interfaces6	2.155 ns	current	1.000
IsInterface1	Interfaces6	1.936 ns	PR	0.898
IsInterface2	Interfaces6	2.161 ns	current	1.000
IsInterface2	Interfaces6	1.908 ns	PR	0.883
IsInterface3	Interfaces6	2.188 ns	current	1.000
IsInterface3	Interfaces6	1.672 ns	PR	0.764
IsInterface4	Interfaces6	2.346 ns	current	1.000
IsInterface4	Interfaces6	1.965 ns	PR	0.838
IsInterface5	Interfaces6	3.122 ns	current	1.000
IsInterface5	Interfaces6	2.173 ns	PR	0.696
IsInterface6	Interfaces6	3.472 ns	current	1.000
IsInterface6	Interfaces6	2.687 ns	PR	0.774
IsNotImplemented	Interfaces6	3.644 ns	current	1.000
IsNotImplemented	Interfaces6	3.071 ns	PR	0.843

IsInterface1	Interfaces16	2.154 ns	current	1.000
IsInterface1	Interfaces16	1.958 ns	PR	0.909
IsInterface2	Interfaces16	1.938 ns	current	1.000
IsInterface2	Interfaces16	1.950 ns	PR	1.006
IsInterface3	Interfaces16	2.185 ns	current	1.000
IsInterface3	Interfaces16	1.661 ns	PR	0.760
IsInterface4	Interfaces16	2.326 ns	current	1.000
IsInterface4	Interfaces16	1.655 ns	PR	0.712
IsInterface5	Interfaces16	3.201 ns	current	1.000
IsInterface5	Interfaces16	2.349 ns	PR	0.734
IsInterface6	Interfaces16	3.186 ns	current	1.000
IsInterface6	Interfaces16	2.376 ns	PR	0.746
IsInterface7	Interfaces16	3.360 ns	current	1.000
IsInterface7	Interfaces16	2.578 ns	PR	0.767
IsInterface8	Interfaces16	3.764 ns	current	1.000
IsInterface8	Interfaces16	2.621 ns	PR	0.696
IsInterface9	Interfaces16	4.291 ns	current	1.000
IsInterface9	Interfaces16	3.091 ns	PR	0.720
IsInterface10	Interfaces16	4.866 ns	current	1.000
IsInterface10	Interfaces16	3.021 ns	PR	0.621
IsInterface11	Interfaces16	4.600 ns	current	1.000
IsInterface11	Interfaces16	3.439 ns	PR	0.748
IsInterface12	Interfaces16	4.908 ns	current	1.000
IsInterface12	Interfaces16	3.376 ns	PR	0.688
IsInterface13	Interfaces16	5.862 ns	current	1.000
IsInterface13	Interfaces16	3.720 ns	PR	0.635
IsInterface14	Interfaces16	6.111 ns	current	1.000
IsInterface14	Interfaces16	3.682 ns	PR	0.603
IsInterface15	Interfaces16	5.902 ns	current	1.000
IsInterface15	Interfaces16	3.955 ns	PR	0.670
IsInterface16	Interfaces16	6.406 ns	current	1.000
IsInterface16	Interfaces16	3.947 ns	PR	0.616
IsNotImplemented	Interfaces16	6.689 ns	current	1.000
IsNotImplemented	Interfaces16	5.412 ns	PR	0.809

src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs

jkotas

Looks great to me. @VSadov Could you please take a look as well?

MaherJendoubi · 2021-03-07T17:34:13Z

You can delete unnecessary usings in the System.Runtime.CompilerServices.CastHelpers class :
using System.Runtime.Intrinsics; using System.Runtime.Intrinsics.X86;

benaadams · 2021-03-07T17:46:01Z

You can delete unnecessary usings ...

Done

VSadov · 2021-03-07T18:03:06Z

src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs

-                    if (interfaceMap[i + 3] == toTypeHnd)
+                    // If not enough for unrolled, jmp straight to small loop
+                    // as we already know there is one or more interfaces so don't need to check again.
+                    goto few;


If interface table is allocated as 4 elements aligned and zero-filled, we could remove the "few" case entirely. At least in Chk case which is expected to succeed.
That would come with size increase though, so does not seem worth it.

VSadov

LGTM, Nice!!!

AndyAyersMS · 2021-03-07T23:57:48Z

the CastHelpers (also SpanHelpers) are heavily optimized by hands for general cases - I wonder if we should disable PGO for them or even mark as AggressiveOptimization

These methods get prejitted, right? Once we have PGO data for prejitted methods we should look and see if we're happy with the codegen we get.

VSadov · 2021-03-08T02:54:39Z

@AndyAyersMS - Yes, these methods are R2R (so no jit at startup) and may rejit later.

They are not AggressiveOptimization. We know that will make them faster by avoiding tiering indirection, but we will have to jit at startup and we want to avoid that.

dotnet-issue-labeler bot added the area-VM-coreclr label Mar 6, 2021

EgorBo reviewed Mar 6, 2021

View reviewed changes

src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs Outdated Show resolved Hide resolved

EgorBo reviewed Mar 6, 2021

View reviewed changes

src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs Show resolved Hide resolved

Reduce branches in IsInstanceOfInterface/ChkCastInterface

855430a

benaadams force-pushed the IsInstanceOfInterface branch from fb8a3de to 855430a Compare March 6, 2021 18:59

Drop extra var, lea; additional check for small counts

6617221

jkotas reviewed Mar 7, 2021

View reviewed changes

src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs Outdated Show resolved Hide resolved

jkotas reviewed Mar 7, 2021

View reviewed changes

src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs Outdated Show resolved Hide resolved

Feedback

8670404

benaadams marked this pull request as ready for review March 7, 2021 05:26

benaadams changed the title ~~Reduce branches in IsInstanceOfInterface/ChkCastInterface~~ Speed up interface checking and casting Mar 7, 2021

jkotas reviewed Mar 7, 2021

View reviewed changes

src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs Outdated Show resolved Hide resolved

Undo IsInstanceOfClass change

ef16c9b

benaadams requested review from jkotas and EgorBo March 7, 2021 06:28

jkotas requested a review from VSadov March 7, 2021 15:50

jkotas approved these changes Mar 7, 2021

View reviewed changes

Tidy usings

c60450e

VSadov reviewed Mar 7, 2021

View reviewed changes

VSadov approved these changes Mar 7, 2021

View reviewed changes

VSadov merged commit bea0abb into dotnet:main Mar 7, 2021

benaadams deleted the IsInstanceOfInterface branch March 7, 2021 19:54

DrewScoggins mentioned this pull request Mar 11, 2021

[Perf] Changes at 3/7/2021 7:54:19 PM DrewScoggins/performance-2#4224

Open

runfoapp bot mentioned this pull request Mar 12, 2021

[tests] System.Text.Json.Tests segfault, for Libraries Test Run release coreclr OSX x64 Release #47805

Closed

This was referenced Mar 22, 2021

Native Asset failure while System.Collections.Concurrent.Tests #48614

Closed

RunContinueWithStressTestsNoState timing out in CI #2271

Closed

System.Collections.Concurrent.Tests crashing in CI #45517

Closed

benaadams mentioned this pull request Mar 25, 2021

What's new in .NET 6 Preview 3 dotnet/core#5890

Closed

ghost locked as resolved and limited conversation to collaborators Apr 7, 2021

karelz added this to the 6.0.0 milestone May 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up interface checking and casting #49257

Speed up interface checking and casting #49257

benaadams commented Mar 6, 2021 •

edited

Loading

EgorBo commented Mar 6, 2021 •

edited

Loading

benaadams commented Mar 6, 2021

EgorBo commented Mar 6, 2021

benaadams commented Mar 7, 2021

benaadams commented Mar 7, 2021 •

edited

Loading

jkotas left a comment

MaherJendoubi commented Mar 7, 2021 •

edited

Loading

benaadams commented Mar 7, 2021

VSadov Mar 7, 2021

VSadov left a comment

AndyAyersMS commented Mar 7, 2021

VSadov commented Mar 8, 2021

Speed up interface checking and casting #49257

Speed up interface checking and casting #49257

Conversation

benaadams commented Mar 6, 2021 • edited Loading

EgorBo commented Mar 6, 2021 • edited Loading

benaadams commented Mar 6, 2021

EgorBo commented Mar 6, 2021

benaadams commented Mar 7, 2021

benaadams commented Mar 7, 2021 • edited Loading

jkotas left a comment

Choose a reason for hiding this comment

MaherJendoubi commented Mar 7, 2021 • edited Loading

benaadams commented Mar 7, 2021

VSadov Mar 7, 2021

Choose a reason for hiding this comment

VSadov left a comment

Choose a reason for hiding this comment

AndyAyersMS commented Mar 7, 2021

VSadov commented Mar 8, 2021

benaadams commented Mar 6, 2021 •

edited

Loading

EgorBo commented Mar 6, 2021 •

edited

Loading

benaadams commented Mar 7, 2021 •

edited

Loading

MaherJendoubi commented Mar 7, 2021 •

edited

Loading