Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
JIT: modify box/unbox/isinst/castclass expansions for fast jitting #13188
When the jit is generating code in debug/minopts/rare-block/Tier0 modes,
Generally speaking smaller intermediate and final code should correlate
This is preliminary work to experiment with different box expansion strategies
Regular jit-diffs shows some small improvement from the rare block logic:
This also helps clean up some of the craziness seen in #13187 as boxes of large structs on cold paths are now kept compact (while still doing way more work than needed to feed GetType).
Forcing on minopts and running jit-diffs shows larger wins, so hoping maybe these show up in the new minopts TP runs....
Still some losses to look at. Small structs may benefit from inline expansion, and the best expansion depends both on how the box value is produced and on how the box is consumed.
The TP job data was inconclusive; not surprising given the relatively small expected impact.
To get a more detailed look, I used a variant of the instructions retired explorer to mine ETL on windows. Looking just at corelib,
So there is no TP impact in regular opts, and a small win in minopts.
FWIW raw data looks like this:
With full opt, jit time almost doubles over minopts:
Added similar checks to the isinst/castclass helpers. These kick in a fair amount. Updated stats:
So on coreclib, about a 1% TP improvement for minopts, with 1.9% smaller code.
Jit-diffs under-reports impact on other assemblies because R2R overrides all this logic already. Need to look at fragile prejitting for a broader take, or else assess on destktop.
We can extend this approach to UNBOX and possibly more. Still investigating.
LGTM - and I really appreciate the restructuring and helpful comments and dumps.
Added similar logic to unbox. This kicks in more broadly in jit-diffs since there is no R2R special casing. Net effect of all 3 on jit-diffs (with MINOPTS) is now:
and the updated corelib TP data (reordered a bit from above) is:
There is now a slight regression in non-MINOPTS tp, so it may be that simple unboxes (primitives and structs with 1 field, say) should always be expanded inline as we are doing for simple boxes.
TP job shows about a 1% net improvement for minopts.
Also a regression for non-minopts. The regression is a bit harder to understand as the only non-minopts impact should come from run rarely blocks. Looks like most of the regression cases are in small assemblies, so this could be in part an artifact of how the overall aggregate score is computed. Will pick one or two and drill in further.
There is a hypothesis of mine that (for minopts anyways, and even generally for maxopt) the time spent jitting is proportional to the size of the jit output. Let's see how well that holds up here.
Looking just at S.P.Corelib where we have detailed data, there is a net 2.45% decrease in native code size. Jit time is around 60% of crossgen time. So if jit time is directly proportional to native code size, we'd expect to see around a (2.45 * .6) = 1.5% improvement in overall throughput.
Inst retired data shows a 1.4% improvement. So the hypothesis that native code size ~ jit time seems to hold up pretty well here.
This can be used as a proxy for evaluating other similar approaches to decreasing minopts time -- or more generally trying to get the jit to generate code as fast as possible, eg for debuggable codegen or as a new Tier0 jitting mode.
It would be interesting to see what the relationship is between the size of the IR at various points in the phase order and the size of the JIT output. I would guess that there is a pretty direct relationship there as well.
referenced this pull request
Aug 6, 2017
From the minopts TP job, geomean diff/base time ratio is 0.985, net diff/base ratio is 0.988. No obvious size-related bias in the results, though it appears smaller runs are noisier (not surprising).
Each data point here is the average of 5 runs. But for this kind of thing averaging can be misleading since we don't have any expectation as to the underlying "noise" distribution.
For example, the worst data point below comes from crossgenning System.Memory, which is reported as being 1.6x slower. The per-iteration times are
So the diff Run5 trashes the diff results. If we'd used Median instead to summarize results, we'd have reported a 1.1x slowdown, which seems more representative.
changed the title from
[WIP] Experiment with box expansion strategy for fast jitting
JIT: modify box/unbox/isinst/castclass expansions for fast jitting
Aug 8, 2017
Desktop tests passed, but after looking over diffs I'm going to back away from using the box helper for rarely run blocks in full opt. It seems tricky to get the logic right here; using the helper call can break the box(value) == null optimization and also may push zeroinits into the prolog.
Since the value of this in full opt is less clear, it seems prudent to just defer for now.
isinst/castclass/unbox don't seem to have such issues, but will revisit just to be sure, once the diffs from unbox are out of the picture.