Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid codegen where DistIL emits unexpected default(ref sbyte) #35

Closed
neon-sunset opened this issue Apr 25, 2024 · 6 comments
Closed

Comments

@neon-sunset
Copy link

Given original code (ILSpy view)

[MethodImpl(MethodImplOptions.AggressiveOptimization)]
internal static nuint CountRunes(ref byte src, nuint length)
{
	nuint num = 0u;
	ref sbyte reference = ref Unsafe.As<byte, sbyte>(ref src);
	ref sbyte reference2 = ref UnsafeExtensions.Add(ref reference, length);
	if (Vector256.IsHardwareAccelerated && length >= (nuint)Vector512<byte>.Count)
	{
		ref sbyte right = ref UnsafeExtensions.Substract(ref reference2, Vector512<byte>.Count);
		// rest omitted
}

when applying DistIL optimization, it is rewritten to the following, with an unexpected default(ref sbyte):

[MethodImpl(MethodImplOptions.AggressiveOptimization)]
internal static nuint CountRunes(ref byte src, nuint length)
{
	nuint num = 0u;
	ref sbyte reference = ref Unsafe.Add(ref Unsafe.As<byte, sbyte>(ref src), length);
	ref sbyte reference2 = default(ref sbyte); // <-- unexpected
	if (Vector256.IsHardwareAccelerated && length >= (nuint)Vector512<byte>.Count)
	{
	        // Immediate NRE here
	        Vector512<byte> vector = Vector512.LessThan(Vector512.LoadUnsafe(ref reference2), right2).AsByte();
		// rest omitted
}

Steps to reproduce:

  1. clone https://github.com/U8String/U8String
  2. cd U8String/Sources/U8String
  3. dotnet add package DistIL.OptimizerTask --prerelease
  4. Add <DistilAllMethods>true<DistilAllMethods> to .csproj
  5. cd ../..
  6. dotnet test -c Release

The miscompiled method in question is https://github.com/U8String/U8String/blob/main/Sources/U8String/Shared/U8Searching.cs#L615

Let me know if you need any other details. Thanks!

@dubiousconst282
Copy link
Owner

Thanks. I've tested using the main branch and everything is passing green (to my surprise). The NuGet package is quite outdated and has a bug in the register allocator, which I believe is the cause of this issue (a regression fixed in 0a8dce7).

I'll update the package shortly!

(Also, I'm curious to see benchmarks, but these will take a while to run on my machine. I'm a bit skeptical that DistIL will do much to help here though lol. On another note, I think it would be interesting to fold the Unsafe intrinsics into IL, although that's probably just a cosmetic size opt.)

@neon-sunset
Copy link
Author

neon-sunset commented Apr 26, 2024

To be honest, I mostly tried it out of curiosity. There is, however, a legitimate use case for DistIL - U8String pushes JIT quite a bit and is the kind of project that runs into its limits way more often than is desirable.

Specifically, inlining size budget, inlining depth and locals limit are being hit either separately or all at once within all kinds of demo code. It took me about three attempts to get SplitFirst/Last more or less right so that seemingly simple looking code is not some ungodly 650 locals behind the scenes and JIT giving up halfway through causing terrible spills and whatnot.

Moreover, while codegen quality has massively improved in the span of .NET Core 3.1 and 8, it's still far from perfect and tends to be brittle especially in light of things like enregistered local -> spill and pass byref -> enregister again. It would remove that post-inlining but there is still impact of the way optimization phases interact so it's not perfect and ends up either using more registers than necessary, interact poorly with "register homing" and overall means more work for the JIT, which is bad bad.

It is the reason that U8String still uses rudimentary interceptor generator to replace certain precondition checks, even if JIT is perfectly capable to optimize them away itself, the end result still ends up making significant difference around inlining and quality.

Which is why DistIL is interesting as it makes certain abstractions truly zero-cost. However, the first impression is that the way it inlines sometimes interacts poorly with the way JIT does so.

JIT inlines methods top-down, so that cheap single-statement methods like operators or property accessors forwarding the call, or cheap dispatch methods which are written in a way for the JIT to see constant folding opportunities and boost their inline profitability. are "peeled away" as it keeps inlining methods into the caller. But because DistIL might have inlined a small body into the operator, it ends up being an operator call that won't ever inline, even if it absolutely must occur (I haven't examined inlining log output yet but it's not like we can do something about it until it gets better).

Therefore, DistilAllMethods is definitely an option too nuclear for U8String but I'm keeping my eye on the project as it could meaningfully help at certain spots where other options involve making project complexity worse (it's not complex per se - all logic is very simple but because it's a shitton of method overloads, it's hard to mentally keep track of everything).

@dubiousconst282
Copy link
Owner

The JIT limitation imposed around the number of locals is quite an annoyance and I have hit it a couple times in other projects. I don't know exactly how RyuJIT creates/counts as locals, but since DistIL effectively never touches structs, I imagine it would make sense to try decompose them into locals so that the register allocator can hopefully kick in later and help decrease the amount, like the object SROA does (although that quickly becomes complicated for structs that are ever passed to other methods or taken by ref again).

As for the inliner, I've added some rudimentary heuristics earlier this year, whereas the previous version would have simply inlined anything within 32 bytes of IL, the new heuristic is based on a budget system and will take into account caller and callee sizes to try and avoid growing the body by more than twice its original size. I have not yet analysed how it interacts with the JIT, but observed that it does help stop excessive growth in IL size, which I think is most desirable as you know JIT has some scary hard limitations around sizes. It's a tricky balance though, as optimization passes typically rely on localized info to work.

(On a more personal note, for whatever it is worth, is that the sheer amount of effort around things left to do, test, and fix, plus design flaws like #33 have kind of drained my motivation. This is not to say that I'm abandoning DistIL now, I do believe that it does have some potential, but I think that realistically RyuJIT might as well catch up to whatever it can do quicker than what I can keep up with. Although they do have different priorities and goals of course, and the sort of higher lever optimizations probably won't ever be a thing.)

@neon-sunset
Copy link
Author

neon-sunset commented Apr 27, 2024

Yeah, I can totally relate to "hitting a particularly difficult part and losing the energy from easy dopamine available earlier".

FWIW, DistIL is the only project that properly solves "LINQ problem" that C# has where I personally believe Roslyn team holds an opinion so wrong it's infuriating (because the effort that DistIL applies to the problem is something that should be a part of standard library at this point given C#'s performance aspirations).

Personally, were DistIL reach the production-ready state only in its LINQ rewriting (especially around lambdas) and nothing else, I would immediately adopt it in every project that makes use of LINQ (provided it works sufficiently well with DistilAllMethods).

Because when you look at list.First(item => item == other) and it is lowered to a very tight loop with perfect codegen, it's just so disappointing to go back to regular code knowing it's on the opposite end of performance spectrum to that.

Perhaps it is worth to polish just the LINQ part, release that and then promote the project to get aid from other contributors too? (by no means I want to impose on you, and I'm also guilty of being busy with U8String given that the correct response for feature requests in OSS projects is often "contribute it yourself", apologies if this comes off wrong)

The JIT limitation imposed around the number of locals is quite an annoyance and I have hit it a couple times in other projects. I don't know exactly how RyuJIT creates/counts as locals, but since DistIL effectively never touches structs, I imagine it would make sense to try decompose them into locals so that the register allocator can hopefully kick in later and help decrease the amount, like the object SROA does (although that quickly becomes complicated for structs that are ever passed to other methods or taken by ref again).

Yup. Manual decomposition should definitely help with that - it's the reason why so many of the methods on u8string do var (value, offset, length) = this - .NET 8+ JIT handles struct decomposition (physical promotion, scalar replacement, call it however you want) so much better, to the point where even "duplicate" values in the form of dereferenced this and its constituent parts are de-duplicated, but this does consume its resources and isn't completely free.

It's a tricky balance though, as optimization passes typically rely on localized info to work.

Yeah, F# has also historically had issues around it (it does IL inlining too) because so much of the tuning in inlinepolicy.cpp is reliant on a particular way IL emitted by Roslyn looks like, so, despite the flaws, chains of method call forwards are okay (up until depth limit ha), but when you inline them in IL it may make JIT think it's less profitable, causing unexpected deoptimization (which is why in U8String, Length and Offset reads are kept as property accessors and never manually inlined, because it makes codegen better and inliner work properly, god why...)

@dubiousconst282
Copy link
Owner

dubiousconst282 commented Apr 27, 2024

You make a good point. I guess I've been focusing at too many things at the same instead of getting it to do one thing well, and optimizing LINQ is evidently going the most impactful of those things.

The idea is hardly new of course, but as you said, DistIL is probably the most seamless implementation of that. (Actually, I believe it is in fact not much different from the old linq-roslyn-rewrite project, but AFAIR it did so by hooking the build system into a modded Roslyn. The existing pipeline works surprisingly well as far as I can tell, but I hope to eventually get a fully generic set of passes that devirtualizes/decomposes everything without being hardcoded to specific cases like it is currently, which I did made some progress towards, and apart from some tricky cases, seems to be far from impossible.)

I think the main thing left to do for a mostly production-ready state is writing/integrating some tests from other projects, just to be sure it won't break anything spectacularly. I'm reasonably confident on most other passes and the frontend/backend, but having some fuzzing done eventually wouldn't hurt (maybe via Fuzzlyn).

And I have a feeling that getting contributions is very hard even on larger projects (evidently more so around .NET projects?), so I'll not keep my hopes up in the mean time.

@dubiousconst282
Copy link
Owner

I'll close this since the original issue has been solved, but I have taken some notes and hope to start working on them soon. Feel free to submit new issues or message me for anything more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants