For context, Here's the before/after perf hit on Array.Sort():
Very specifically, I was hit with a 20% drop in Array.Sort(), and a smaller (by sheer luck I presume?) hit with my own code. Intel claims that the average drop in performance should be in the %0-%4 range.
The erratum claims that the microcode update adversely effects:
In this context, Jump Instructions include all jump types: conditional jump (Jcc), macro-
fused op-Jcc (where op is one of cmp, test, add, sub, and, inc, or dec), direct
unconditional jump, indirect jump, direct/indirect call, and return.
The recommended fix, per the erratum, is to recompile the code with (Section 3.1.1): -mbranches-within-32B-boundaries
In theory, this could end with a relatively minor(?) CoreCLR runtime release. Or incorporated into the 3.1 release cycle?
In addition, I think we should have a discussion about possibly applying similar (conditional) fixes to the JIT itself w.r.t the generated code on Intel processors, as it doesn't seem this issue is about to simply go away on its own.
I will open a separate issue for the JIT related discussion.
The text was updated successfully, but these errors were encountered:
Separately, since you used Array.Sort as an example, it's no longer implemented in C; as of last week the implementation is all C#. I'm curious if that affects your stated impact at all, or if you see a similar affect on Array.Sort with the latest master?
@stephentoub That's a very good question.
I haven't tested with master yet, and I am aware of the C++ -> C# transition for that specific piece of code.
My code itself (in this case the Scalar + Unmanaged variations of IntroSort) is definitely
affected (as seen in the same screenshot), with very wild outcomes:
Scalar, for example, suffers a 13% perf drop.
Unmanaged is hit with a very minor 1.4% drop.
Given that It was pretty much a copy-paste of IntroSort from CoreCLR to begin with, I think this gives you a "preview" of where this might go.
I assume that the effects are random in nature, depending on code generation/allocation address randomness.
But in general, the more branchy your code is, you'll probably get hit harder, which can also be seen in my two examples: these are practically the same functions, except that only one of them has extra bounds-checking that is not eliminated normally (nor did I attempt to optimize it away) and the other doesn't...
My other benchmarks actually end up doing substantially fewer branches (as in orders of
magnitudes fewer branches), so I suspect I won't see a lot of movement on that front...
Just wanted to circle back and to say that I managed to revert the microcode update in a very localized way on clearlinux (or I assume any Linux):
Change the kernel boot parameters to avoid early loading of microcode updates (remove the initrd=/EFI/org.clearlinux/freestanding-00-intel-ucode.cpio from the boot cmd line)
Revert to an older package of microcode from Intel to make sure that late loading of microcode does not sneak the update back in, by updating /lib/firmware/intel-ucode/ to the contents of the 2019-09-18 ucode release
After doing those two steps and (naturally) rebooting, I'm back to older microcode (0xb4) for my kaby-lake processor:
$ grep 'stepping\|model\|microcode' /proc/cpuinfo | head -4
model : 158
model name : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping : 9
microcode : 0xb4
Rerunning the same benchmarks now yields the good old results I've been seeing in the past months:
I hope to have all of the data in a format for sharing shortly. The short answer is that I don't think there is any immediate work that needs to be done here, but I will provide more context on this once I have all of the data compiled.