Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: investigate EH Write Through Failures #35534

Closed
AndyAyersMS opened this issue Apr 27, 2020 · 34 comments
Closed

JIT: investigate EH Write Through Failures #35534

AndyAyersMS opened this issue Apr 27, 2020 · 34 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@AndyAyersMS
Copy link
Member

We enable EH write through on the jit-experimental pipeline which runs every weekend, and there are a good number of test failures

https://dev.azure.com/dnceng/public/_build/results?buildId=618423&view=ms.vss-test-web.build-test-results-tab

@AndyAyersMS AndyAyersMS added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 27, 2020
@AndyAyersMS AndyAyersMS added this to the 5.0 milestone Apr 27, 2020
@AndyAyersMS AndyAyersMS self-assigned this Apr 27, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Apr 27, 2020
@AndyAyersMS AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Apr 28, 2020
@AndyAyersMS
Copy link
Member Author

Tests show the following issues:

  • runtime failures, eg JIT\opt\Devirtualization\box2.
  • Assertion failed 'regRecord->assignedInterval == nullptr', eg JIT\Intrinsics\TypeIntrinsics_r\TypeIntrinsics_r

Think I have a fix for some of the runtime failures. Looks like we are not initializing the stack home for a EH live parameter.

@CarolEidt
Copy link
Contributor

@AndyAyersMS - let me know if you'd like me to track down the LSRA assertions.
And thanks for tracking down the runtime failures!

@AndyAyersMS
Copy link
Member Author

@CarolEidt I'll keep looking, but maybe you can give me some pointers.

The assert is in veryfyFinalAllocation: there is a RefTypeKill (RAX killed by helper call) with an assigned interval:

case RefTypeKill:
assert(regRecord != nullptr);
assert(regRecord->assignedInterval == nullptr);
dumpLsraAllocationEvent(LSRA_EVENT_KEPT_ALLOCATION, nullptr, regRecord->regNum, currentBlock);
break;

<RefPosition #18  @14  RefTypeKill <Reg:rax> BB01 regmask=[rax] minReg=1 last>

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Apr 28, 2020

The associated interval is a write-through local:

;  V08 loc6              ref  EH must-init class-hnd EH-live
Interval  8: (V08) ref (SPILLED) (writeThru) RefPositions {#2@0 #3@0 #754@705 #1006@962 #1036@1009} physReg:NA Preferences=[rax]

@CarolEidt
Copy link
Contributor

I would guess that V08 didn't get a register at entry, or was allocated rax and then latter spilled, but was not marked spilled. I'm not sure why there are two RefPositions at location zero (i.e. entry). The entry RefPositions (which are usually either RefTypeParamDef or RefTypeZeroInit but not both, get special handling, so perhaps something is missing there.

@AndyAyersMS
Copy link
Member Author

For the two ref positions:

V08 was live in to first block: creating ZeroInit
<RefPosition #2   @0   RefTypeZeroInit <Ivl:8 V08> IL_OFFSET BB01 regmask=[allIntButFP] minReg=1>
V08 is a finally var: creating ZeroInit
<RefPosition #3   @0   RefTypeZeroInit <Ivl:8 V08> IL_OFFSET BB01 regmask=[allIntButFP] minReg=1>

I think V08's liveness is over-stated but perhaps the use (in a return) is leaking upwards through infeasible EH paths.

First bit of the allocation table:

-----------------------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+
Loc  RP#   Name  Type  Action Reg  |rax  |rcx  |rdx  |rbx  |rbp  |rsi  |rdi  |r8   |r9   |
-----------------------------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+
                                   |     |V0  a|V1  a|     |     |     |     |     |     |
   0.#0    V1    Parm   Alloc rsi  |     |V0  a|     |     |     |V1  a|     |     |     |
   0.#1    V0    Parm   Alloc rdi  |     |     |     |     |     |V1  a|V0  a|     |     |
   0.#2    V8    Zero   Alloc rax  |V8  a|     |     |     |     |V1  a|V0  a|     |     |
   0.#3    V8    Zero   Keep  rax  |V8  a|     |     |     |     |V1  a|V0  a|     |     |
   1.#4    BB1  PredBB0            |V8  a|     |     |     |     |V1  a|V0  a|     |     |
   7.#5    rdx   Fixd   Keep  rdx  |V8  a|     |     |     |     |V1  a|V0  a|     |     |
   7.#6    V1    Use    Copy  rdx  |V8  a|     |V1  a|     |     |V1  a|V0  a|     |     |
   8.#7    rdx   Fixd   Keep  rdx  |V8  a|     |V1  a|     |     |V1  a|V0  a|     |     |
   8.#8    I35   Def    Alloc rdx  |V8  a|     |I35 a|     |     |V1  a|V0  a|     |     |
  10.#9    C36   Def    Alloc rcx  |V8  a|C36 a|I35 a|     |     |V1  a|V0  a|     |     |
  11.#10   rcx   Fixd   Keep  rcx  |V8  a|C36 a|I35 a|     |     |V1  a|V0  a|     |     |
  11.#11   C36   Use *  Keep  rcx  |V8  a|C36 a|I35 a|     |     |V1  a|V0  a|     |     |
  12.#12   rcx   Fixd   Keep  rcx  |V8  a|     |I35 a|     |     |V1  a|V0  a|     |     |
  12.#13   I37   Def    Alloc rcx  |V8  a|I37 a|I35 a|     |     |V1  a|V0  a|     |     |
  13.#14   rdx   Fixd   Keep  rdx  |V8  a|I37 a|I35 a|     |     |V1  a|V0  a|     |     |
  13.#15   I35   Use *  Keep  rdx  |V8  a|I37 a|I35 a|     |     |V1  a|V0  a|     |     |
  13.#16   rcx   Fixd   Keep  rcx  |V8  a|I37 a|I35 a|     |     |V1  a|V0  a|     |     |
  13.#17   I37   Use *  Keep  rcx  |V8  a|I37 a|I35 a|     |     |V1  a|V0  a|     |     |
  14.#18   rax   Kill   Spill rax  |     |     |     |     |     |V1  a|V0  a|     |     |
                        Keep  rax  |     |     |     |     |     |V1  a|V0  a|     |     |

Method being jitted:

// If the immediate child is another scope, merge it into this one
// This is an optimization to save environment allocations and
// array accesses.
private ReadOnlyCollection<Expression> MergeScopes(Expression node)
{
ReadOnlyCollection<Expression> body;
var lambda = node as LambdaExpression;
if (lambda != null)
{
body = new ReadOnlyCollection<Expression>(new[] { lambda.Body });
}
else
{
body = ((BlockExpression)node).Expressions;
}
CompilerScope currentScope = _scopes.Peek();
// A block body is mergeable if the body only contains one single block node containing variables,
// and the child block has the same type as the parent block.
while (body.Count == 1 && body[0].NodeType == ExpressionType.Block)
{
var block = (BlockExpression)body[0];
if (block.Variables.Count > 0)
{
// Make sure none of the variables are shadowed. If any
// are, we can't merge it.
foreach (ParameterExpression v in block.Variables)
{
if (currentScope.Definitions.ContainsKey(v))
{
return body;
}
}
// Otherwise, merge it
if (currentScope.MergedScopes == null)
{
currentScope.MergedScopes = new HashSet<BlockExpression>(ReferenceEqualityComparer.Instance);
}
currentScope.MergedScopes.Add(block);
foreach (ParameterExpression v in block.Variables)
{
currentScope.Definitions.Add(v, VariableStorageKind.Local);
}
}
body = block.Expressions;
}
return body;
}

V08 (local6) is defined/used by that inner return.

@AndyAyersMS
Copy link
Member Author

I'm also looking for simpler failing test cases or a simple repro based on the above, but no luck so far.

@CarolEidt
Copy link
Contributor

I was able to reproduce this using SuperPmi. It is indeed the double zero-init that is causing the problem. I submitted PR #35585 to fix it.

@AndyAyersMS
Copy link
Member Author

Still ~34 failures ( a bit less ) with #35585. They all look like unexpected null reference exceptions. I'll look for a simple case and try and see what's up.

@AndyAyersMS
Copy link
Member Author

A number of the remaining bugs involve WaitForExitCore -- failure is an AV in the callee GetProcessHandle because we call it with null in RCX.

Looks like either there's a missing reload of RCX before the call, or else the xor of RCX at the end of the zeroing done in the jit prolog inadvertently trashes RCX which we expected we could keep live.

Will keep digging.

;; System.Diagnostics.Process:WaitForExitCore(int):bool:this
G_M64267_IG01:
       55                   push     rbp
       4156                 push     r14
       57                   push     rdi
       56                   push     rsi
       53                   push     rbx
       4883EC40             sub      rsp, 64
       488D6C2460           lea      rbp, [rsp+60H]
       33C0                 xor      rax, rax
       488945D8             mov      qword ptr [rbp-28H], rax
       488965C0             mov      qword ptr [rbp-40H], rsp
       48894D10             mov      gword ptr [rbp+10H], rcx
       895518               mov      dword ptr [rbp+18H], edx
       33C9                 xor      rcx, rcx
						;; bbWeight=1    PerfScore 10.25
G_M64267_IG02:
       4533C0               xor      r8, r8
       33C0                 xor      rax, rax
       488945D8             mov      gword ptr [rbp-28H], rax
						;; bbWeight=1    PerfScore 1.50
G_M64267_IG03:

;  **** need to reload RCX here, or not zero it above ***

       BA00001000           mov      edx, 0x100000
       4533C0               xor      r8d, r8d
       E8081BFEFF           call     System.Diagnostics.Process:GetProcessHandle(int,bool):Microsoft.Win32.SafeHandles.SafeProcessHandle:this

@AndyAyersMS
Copy link
Member Author

Looks to me like RCX is getting trashed. The troublemaker is

;  V02 loc0         [V02,T02] ( 11,  4.50)     ref  ->  [rbp-0x28]   EH must-init class-hnd EH-live

which is assigned RCX for a stretch down in the try body. We zero both its memory and its register locations in the jit prolog; unfortunately the register location isn't live there.

Relevant logic is here:

/* For lvMustInit vars, gather pertinent info */
if (!varDsc->lvMustInit)
{
continue;
}
bool isInReg = varDsc->lvIsInReg();
bool isInMemory = !isInReg || varDsc->lvLiveInOutOfHndlr;
if (isInReg)
{

Seems plausible that if a variable is must init and in both memory and a register we only need to zero memory in the prolog, but that's probably too simplistic.

@CarolEidt thoughts?

@CarolEidt
Copy link
Contributor

It seems that the register allocator thinks that the variable is in RCX at procedure entry, so that seems to be the source of the issue. I did a little searching and I didn't find any clear indication of what might be going wrong. If you have a jitdump you can search for:

Recording Var Locations at start of BB01

Which will show the variables that it believes are live in registers at the start of the block (it shows up twice, once we we are generating code for that block, and once just prior to:

*************** In genFnProlog()

Let me know if you'd like me to take over tracking this down.

@AndyAyersMS
Copy link
Member Author

Only V00 (this) is live in BB01:

Recording Var Locations at start of BB01
  V00(rcx)
*************** In genFnProlog()

V02 becomes live in BB01, but at that point it's in RAX. It only live is in RCX later in the method -- that is, its highest refpos appearances are in RCX. I wonder if this carries over to the prolog codegen and that's why we think we need to zero RCX.

For a variable that lives in different registers in different parts of the code, what's the intended meaning of varDsc->GetRegNum() ?

@CarolEidt
Copy link
Contributor

For a variable that lives in different registers in different parts of the code, what's the intended meaning of varDsc->GetRegNum() ?

It is the "current" register occupied by the variable. Unfortunately, I believe that it relies on the invariant that you only query it if the variable is actually live. Otherwise the register allocator would have to reset the register number for all the variables at each boundary, not just those that are live. In this case V02 isn't live, but its last register was apparently RCX. It seems that this would be an issue for any lvMustInit variable that was in a register at the end of the method, so I'm not sure why this isn't a problem without EHWriteThru.

@AndyAyersMS
Copy link
Member Author

If you want to investigate, here's a simple repro:

using System;
using System.Diagnostics;

class X
{
    public static int Main()
    {
        var process = new Process {
            StartInfo = new ProcessStartInfo {
                FileName = "notepad.exe"
            }
        };
        process.Start();
        process.WaitForExit();
        return 100;
    }
}

Key method is WaitForExitCore.

@CarolEidt
Copy link
Contributor

Thanks, I'll take a look.

@CarolEidt
Copy link
Contributor

So ... without EHWriteThru we never have a lvMustInit variable that's not live-in to the entry block, but with EHWriteThru we set lvMustInit on all the variables that are live-in to a finally block. Here's a fix that addresses this case, and adds an assert that in all other cases it must be live-in: #35723

@AndyAyersMS
Copy link
Member Author

Great -- this fixes my local repro case. I'll add the jit-experimental testing to your new PR.

This may be the last issue for x64 Pri1 tests.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented May 3, 2020

Now that all the fixes are in, I launched a few ASP.NET perf runs.

[EDIT] Baseline data here is stale, so improvements are not accurate -- see comments below

On Json for windows x64, I am seeing a 7% improvement in RPS and 10% improvement in latency.

Description RPS CPU (%) Memory (MB) Avg. Latency (ms) Startup (ms) Build Time (ms) Published Size (KB) First Request (ms) Latency (ms) Errors Ratio
Baseline 329,134 75 168 2.47 466 4551 106554 51.12 0.52 0 1.00
EHWT 350,934 78 163 2.21 458 4053 106808 57.89 0.54 0 1.07

On ResponseCachingPlaintextCached for windows x64, 4% improvement on RPS, 10% on latency.

Description RPS CPU (%) Memory (MB) Avg. Latency (ms) Startup (ms) Build Time (ms) Published Size (KB) First Request (ms) Latency (ms) Errors Ratio
Baseline 900,719 95 168 3.44 459 4540 106554 53.86 0.55 0 1.00
EHWT 936,250 96 168 3.09 460 4056 106808 61.25 0.48 0 1.04

@sebastienros can you help us do more comprehensive testing? To enable this you need a build from Saturday or later, and need to set

COMPlus_EnableEHWriteThru=1

@sebastienros
Copy link
Member

What OSes and architectures should it be tested on?

@AndyAyersMS
Copy link
Member Author

It should improve codegen for all OSes/Architectures.

@sebastienros
Copy link
Member

Preliminary results on 12 core machines

Scenario ARCH OS ENV RPS CPU (%) Memory (MB) Avg. Latency (ms) Startup (ms) Build Time (ms) Published Size (KB) First Request (ms) Latency (ms) Errors
PlaintextPlatform INTEL Windows Baseline 4,974,135 74 53 2.75 266 3867 87990 31.77 0.45 0
PlaintextPlatform INTEL Windows EHWT 4,965,548 71 54 2.63 272 3545 87990 32.88 0.49 0
Plaintext INTEL Windows Baseline 2,309,179 85 58 1.93 530 4212 106817 41.3 0.48 0
Plaintext INTEL Windows EHWT 2,287,358 85 58 2.12 524 4044 106817 41.67 0.41 0
MvcPlaintext INTEL Windows Baseline 781,889 93 182 4.78 585 4031 106817 72.85 0.44 0
MvcPlaintext INTEL Windows EHWT 771,005 93 180 4.61 577 4047 106817 72.31 0.37 0
Json INTEL Windows Baseline 328,910 74 164 2.35 516 4041 106817 53.85 0.39 0
Json INTEL Windows EHWT 330,200 75 165 2.19 510 4037 106817 52.58 0.44 0
MvcJson INTEL Windows Baseline 162,222 88 179 3.44 577 4040 106817 82.17 0.48 0
MvcJson INTEL Windows EHWT 163,071 84 180 3.18 576 4034 106817 87.4 0.68 0
DbFortunesRaw INTEL Windows Baseline 115,521 88 214 2.63 512 4035 106817 389.58 1.07 0
DbFortunesRaw INTEL Windows EHWT 115,419 88 207 2.6 519 4028 106817 387.23 1.11 0
MvcDbFortunesEf INTEL Windows Baseline 42,030 88 333 6.85 640 4029 106817 921.32 1.3 0
MvcDbFortunesEf INTEL Windows EHWT 41,798 89 325 7.09 644 4038 106817 920.97 1.35 0
PlaintextPlatform INTEL Linux Baseline 5,115,384 99 69 2.08 224 4502 102118 49.33 0.39 0
PlaintextPlatform INTEL Linux EHWT 5,176,576 99 70 2.39 231 4002 102118 52.28 0.34 0
Plaintext INTEL Linux Baseline 2,084,199 99 83 1.33 381 4335 120698 54.46 0.4 0
Plaintext INTEL Linux EHWT 2,083,963 99 82 1.35 372 4002 120698 59.62 0.51 0
MvcPlaintext INTEL Linux Baseline 723,120 97 205 3.27 395 4002 120698 101.96 0.46 0
MvcPlaintext INTEL Linux EHWT 738,900 97 204 3.19 414 4002 120698 93.82 0.52 0
Json INTEL Linux Baseline 360,628 99 191 1.13 375 4002 120698 68.11 0.49 0
Json INTEL Linux EHWT 364,641 99 191 1.09 381 4002 120698 67.28 0.45 0
MvcJson INTEL Linux Baseline 177,961 98 200 1.59 392 4002 120698 105.27 0.74 0
MvcJson INTEL Linux EHWT 176,572 98 201 1.67 415 4002 120698 110.76 0.72 0
DbFortunesRaw INTEL Linux Baseline 105,510 97 234 2.7 372 4002 120698 431.16 1.24 0
DbFortunesRaw INTEL Linux EHWT 105,452 97 235 2.7 385 4002 120698 426.17 1.28 0
MvcDbFortunesEf INTEL Linux Baseline 44,683 96 254 5.82 471 4002 120698 937.98 1.56 0
MvcDbFortunesEf INTEL Linux EHWT 44,849 97 259 5.79 481 4002 120698 937.98 1.59 0

@AndyAyersMS
Copy link
Member Author

Sigh, I had a bug in my script and was comparing new runs to older baseline files.

Results vs proper baselines are more in line with the data Sebastien has gathered. Odd though that Json RPS was 350K yesterday (for both baseline and EHWT) and only 330K today.

@sebastienros
Copy link
Member

Adding more numbers, this time on the citrine environment. INTEL machines are 14/28(ht) cores, ARM is 32 cores.

Scenario ARCH OS ENV RPS CPU (%) Memory (MB) Avg. Latency (ms) Startup (ms) Build Time (ms) Published Size (KB) First Request (ms) Latency (ms) Errors
PlaintextPlatform ARM Linux Baseline 5,532,785 98 75 1.04 502 10005 115711 119.75 0.46 0
PlaintextPlatform ARM Linux EHWT 5,686,295 98 73 0.91 509 10006 115711 114.75 0.18 0
Plaintext ARM Linux Baseline 2,083,741 99 90 1.54 968 12257 134302 129.02 0.34 0
Plaintext ARM Linux EHWT 2,074,222 99 91 1.77 952 12006 134302 137.13 0.29 0
MvcPlaintext ARM Linux Baseline 289,407 93 121 10.09 1096 12006 134302 237.66 1.13 0
MvcPlaintext ARM Linux EHWT 296,528 92 118 9.72 1112 12257 134302 233.25 0.57 0
Json ARM Linux Baseline 351,495 98 126 1.48 938 12006 134302 163.35 0.59 0
Json ARM Linux EHWT 355,377 98 125 1.52 955 12007 134302 159.38 0.55 0
MvcJson ARM Linux Baseline 105,397 96 120 3.5 1088 12007 134302 258.84 0.82 0
MvcJson ARM Linux EHWT 111,171 96 125 3.37 1082 12007 134302 268.7 0.68 0
FortunesPlatform ARM Linux Baseline 82,966 95 167 4.16 516 12507 115713 1278.05 0.8 0
FortunesPlatform ARM Linux EHWT 82,249 94 196 4.18 498 10005 115713 1303.77 1.29 0
DbFortunesRaw ARM Linux Baseline 60,219 96 182 5.16 930 12007 134302 1187.43 2.23 0
DbFortunesRaw ARM Linux EHWT 59,997 95 190 5.2 955 12006 134302 1211.3 1.35 0
MvcDbFortunesEf ARM Linux Baseline 27,915 95 282 10.66 1254 11757 134302 2890.65 2.75 0
MvcDbFortunesEf ARM Linux EHWT 27,969 96 300 10.73 1223 12007 134302 2893.05 3.47 0
PlaintextPlatform INTEL Windows Baseline 7,949,569 82 59 0.45 285 5536 87990 35.75 0.11 0
PlaintextPlatform INTEL Windows EHWT 7,766,029 82 58 0.4 310 4784 87990 36.08 0.11 0
Plaintext INTEL Windows Baseline 5,296,659 92 63 2.46 565 7031 106818 45.43 0.15 0
Plaintext INTEL Windows EHWT 5,396,604 88 62 2.23 547 4527 106818 45.41 0.1 0
MvcPlaintext INTEL Windows Baseline 2,322,387 98 414 4.27 621 4544 106818 82.96 0.15 0
MvcPlaintext INTEL Windows EHWT 2,336,585 98 412 5.28 603 4532 106818 83.08 0.13 0
Json INTEL Windows Baseline 775,633 80 390 0.45 548 4532 106818 59.57 0.13 0
Json INTEL Windows EHWT 760,177 82 391 0.46 576 4525 106818 59.87 0.12 0
MvcJson INTEL Windows Baseline 514,506 92 404 4.43 618 4523 106818 94.46 0.22 0
MvcJson INTEL Windows EHWT 508,750 92 401 3.72 641 4525 106818 93.77 0.21 0
FortunesPlatform INTEL Windows Baseline 310,423 82 433 1.4 318 4037 87993 513.02 0.38 0
FortunesPlatform INTEL Windows EHWT 309,203 78 430 1.44 310 4036 87993 505.71 0.4 0
DbFortunesRaw INTEL Windows Baseline 278,565 83 434 1.58 548 4531 106818 462.88 0.48 0
DbFortunesRaw INTEL Windows EHWT 276,785 85 435 1.58 541 4539 106818 463.26 0.44 0
MvcDbFortunesEf INTEL Windows Baseline 121,677 94 476 5.55 709 4532 106818 1097.87 0.56 0
MvcDbFortunesEf INTEL Windows EHWT 123,023 96 473 4.09 686 4530 106818 1100.22 0.54 0
PlaintextPlatform INTEL Linux Baseline 9,008,473 99 77 0.49 195 4752 102122 30.89 0.11 0
PlaintextPlatform INTEL Linux EHWT 9,060,435 98 76 0.48 195 4002 102122 29.7 0.11 0
Plaintext INTEL Linux Baseline 4,180,523 99 89 0.9 340 7502 120702 36.39 0.11 0
Plaintext INTEL Linux EHWT 4,181,031 99 89 1.06 337 4001 120702 36.27 0.11 0
MvcPlaintext INTEL Linux Baseline 1,617,319 98 435 1.65 384 4001 120702 68.48 0.12 0
MvcPlaintext INTEL Linux EHWT 1,588,558 98 436 1.78 381 4001 120702 68.36 0.13 0
Json INTEL Linux Baseline 794,491 99 419 1.2 344 4001 120702 48.11 0.11 0
Json INTEL Linux EHWT 797,031 99 417 1.11 334 4001 120702 48.19 0.12 0
MvcJson INTEL Linux Baseline 420,179 98 432 1.02 381 4001 120702 77.18 0.19 0
MvcJson INTEL Linux EHWT 421,569 98 432 1.03 388 4001 120702 77.56 0.2 0
FortunesPlatform INTEL Linux Baseline 303,162 98 472 1.34 199 4001 102124 406.91 0.34 0
FortunesPlatform INTEL Linux EHWT 301,909 98 462 1.4 201 3751 102124 403.55 0.32 0
DbFortunesRaw INTEL Linux Baseline 255,194 98 486 1.46 340 4001 120702 376.94 0.33 0
DbFortunesRaw INTEL Linux EHWT 254,049 98 489 1.52 341 4001 120702 379.51 0.39 0
MvcDbFortunesEf INTEL Linux Baseline 101,943 96 488 2.77 436 4001 120702 904.9 0.64 0
MvcDbFortunesEf INTEL Linux EHWT 103,843 97 492 2.8 427 4001 120702 928.15 0.6 0

@AndyAyersMS
Copy link
Member Author

How much variability do you usually see in results? I'd be surprised if enabling this made things slower, but I see cases above where it looks like we lose 2-3% on RPS. If that's accurate, we should try and look at codegen for the key methods more closely.

@sebastienros
Copy link
Member

Each number is the average of two runs. I will redo the ones that regressed just to be sure. It might happen that a run is bad.

@CarolEidt
Copy link
Contributor

There are definitely places where EH Write through could be slightly worse. In situations where there are multiple definitions in the Try clause, each of those will do a store, and may also define a register - requiring an additional mov if the value could have been directly defined to memory. It is most effective when there are more uses than defs, and when register pressure is not excessive.

@sebastienros
Copy link
Member

I ran PlaintextPlatform 5 times for each, and the min value for no EHWT is greater than the max value with it, so it's definitely slower with on this scenario:

COMPlus_EnableEHWriteThru=0

Description RPS CPU (%) Memory (MB) Avg. Latency (ms) Startup (ms) Build Time (ms) Published Size (KB) First Request (ms) Latency (ms) Errors
8,001,783 84 58 0.381 287.0385 5572.3578 87991 35.5769 0.0977 0
8,039,917 89 58 0.417 281.8752 4027.2104 87991 36.4238 0.1013 0
8,101,041 84 59 0.40608 296.4301 4028.946 87991 35.6608 0.0958 0
7,960,022 85 57 0.41706 286.7777 4043.8363 87991 36.1157 0.1007 0
7,894,460 79 58 0.38552 281.9774 4053.8737 87991 35.8702 0.0974 0

COMPlus_EnableEHWriteThru=1

Description RPS CPU (%) Memory (MB) Avg. Latency (ms) Startup (ms) Build Time (ms) Published Size (KB) First Request (ms) Latency (ms) Errors
7,877,458 82 59 0.383 288.2986 4028.7659 87991 36.1246 0.1018 0
7,891,998 81 58 0.39981 297.0558 4029.3682 87991 35.6291 0.0972 0
7,815,282 78 59 0.39094 286.5538 4045.0774 87991 36.0377 0.1061 0
7,842,825 78 58 0.42286 314.2182 4536.1388 87991 35.9681 0.0999 0
7,804,675 90 58 0.428 286.0867 4021.7947 87991 35.5549 0.0971 0

@AndyAyersMS
Copy link
Member Author

It is most effective when there are more uses than defs, and when register pressure is not excessive.

I'm guessing you've looked into this, but couldn't we screen candidates and only enable write through for locals that have favorable def/use ratios?

@CarolEidt
Copy link
Contributor

couldn't we screen candidates and only enable write through for locals that have favorable def/use ratios?

I didn't pursue that; we don't readily have that information until we build Intervals, and at that point it is difficult for the register allocator to change its mind about making something a candidate. That said, it could presumably decide that some of the candidates should never get a register, effectively making it the same as if it were not a candidate, though that would require some tweaking to avoid actually allocating a register when not needed (it does better with RegOptional uses than defs).

It would require retaining additional information on each Interval which would in turn probably require some additional tuning to ensure that it doesn't degrade throughput.

@AndyAyersMS
Copy link
Member Author

I wonder if we could gather this info during one of our ref count traversals. We currently don't distinguish reads and writes, but it seems like we easily could.

Probably best to first dig into the code that's causing slowdowns in the tests above to ensure this is why things end up slower.

@AndyAyersMS
Copy link
Member Author

This issue is nominally done. Would like to close and capture what remains in a new issue. @CarolEidt is there an issue for enabling EH Write Thru by default? I couldn't find one.

Feels like there is still a moderate amount of work to do before we can turn EH Write Thru on by default, and some risk that we're just not going to see the benefits here we'd hoped to see. I'd like to do a more careful evaluation and try and find or create examples where we clearly do expect to see benefits.

Seems like all this might be a stretch for 5.0 but possibly worthwhile; trying to assess if we should find a way to do that sometime soon, or defer all of this until after 5.0.

@CarolEidt
Copy link
Contributor

Thanks for all your analysis, Andy. I agree that this issue, the investigation, is complete.
There's no issue for enabling EH write thru by default - I'll add one.

@AndyAyersMS
Copy link
Member Author

Thanks, I will close this issue.

@sebastienros thanks for running all those tests for us -- we'll need to dig in deeper to understand what's going on. Hope we can get to it before too long.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

4 participants