State / Direction of C# as a High-Performance Language #10378

Open
ilexp opened this Issue Apr 6, 2016 · 163 comments

Projects

None yet
@ilexp
ilexp commented Apr 6, 2016 edited

I've been following recent development of C# as a language and it seems that there is a strong focus on providing the means to write code more efficiently. This is definitely neat. But what about providing ways to write more efficient code?

For context, I'm using C# mostly for game development (as in "lowlevel / from scratch") which has a habit of gladly abandoning the usual ways of safe code design for that 0.1% of the bottleneck code in favor of maximum efficiency. Unfortunately, there are cases where C# gets in the way of that last bit of optimization.

Issues related to this:

Other sentiments regarding this:

  • The only way to improve / guarantee memory locality right now seems to be putting the data into an array of structs. There are scenarios where this is a bit impractical. Are there ways to handle this with classes?
  • Language support for object pooling / limited control over what exactly "new" does for a given class / reference type.
  • A way to instantiate a reference type "within scope", making cleaning it up more efficient by somehow providing the GC with the extra knowledge about an explicit / intended lifespan.
  • Please share your own in a comment

This is probably more of a broader discussion, but I guess my core question is: Is there a general roadmap regarding potential improvements for performance-focused code in C#?

@JoshVarty
Member

A way to instantiate a reference type "within scope", making cleaning it up more efficient by somehow providing the GC with the extra knowledge about an explicit / intended lifespan.

I believe this was also requested in #161

@HaloFour
HaloFour commented Apr 7, 2016
  • The only way to improve / guarantee memory locality right now seems to be putting the data into an array of structs. There are scenarios where this is a bit impractical. Are there ways to handle this with classes?
  • A way to instantiate a reference type "within scope", making cleaning it up more efficient by somehow providing the GC with the extra knowledge about an explicit / intended lifespan.

I don't believe that these issues can be solved without direct CLR support. The CLR limits reference types to the heap. Even C++/CLI is forced to abide by that restriction and the stack semantics syntax still allocates on the heap. The GC also provides no facility to directly target specific instances.

I wonder how much C# could make a struct feel like a class before it crosses into unsafe/unverifiable territory. C++/CLI "native" classes are CLR structs so you don't have to deal with allocation/GC but of course the IL it emits is quite nasty.

@ilexp
ilexp commented Apr 8, 2016

I've added some more related issues to the above list, which hadn't been mentioned yet.

@amulware
amulware commented Apr 8, 2016

I am in a very similar position to @ilexp, and generally interested in the performance of my code, and knowing how to write efficient code. So I'd second the importance of this discussion.

I also think the summary and points in the original post are quite good, and have nothing to add at the moment.

Small note on using structs sort of like classes (but keeping everything on the stack):
I believe we can 'just' pass our structures down as ref for this purpose?
Make sure you don't do anything that it that creates a copy, and it should look like a class...
Not sure if that work flow needs any additional support from the language.

About memory locality: I was under the impression that if I new two class-objects after each other, they will also be directly after each other in memory, and stay that way? May be an implementation detail, but it's better than nothing... That being said, I've had to move from lists of objects to arrays of structs for performance reasons as well (good example would be particle systems, or similar simulations that have many small and short lived objects). Just the overhead from resolving references and having to gc the objects eventually made my original solution unfeasible. I am not sure this can be 'fixed' in a managed language at all though...

Looking forward to seeing what others have to say on this topic!

@mattwarren

The only way to improve / guarantee memory locality right now seems to be putting the data into an array of structs. There are scenarios where this is a bit impractical. Are there ways to handle this with classes?

There was a really nice prototype done by @xoofx showing the perf improvements of allowing stackalloc on reference types.

@SunnyWar

The only way to improve / guarantee memory locality right now seems to be putting the data into an array of structs. There are scenarios where this is a bit impractical. Are there ways to handle this with classes?

Microsoft Research many years ago experimented with using some unused bits on each object as access counters. The research hacked the heap to re-organized mostly used objects so that they ended up on the same page. He showed in a sample XML parser that C# code was faster than optimized C++. The talk he gave on it was called, "Making C# faster than C#". The researcher that developed the technique left MS and the research apparently died with him. He had a long list of other, similar improvements that he was planning on trying. None of which, I believe, saw daylight.

Perhaps this work should be resuscitated so that the promise (made in the beginning: remember how the JITer was going to ultra-optimize for your hardware??) can be realized.

@Claytonious

We are in the crowded boat of using c# with Unity3d, which may finally be moving toward a newer CLR sometime soon, so this discussion is of great interest to us. Thanks for starting it.

The request to have at least some hinting to the GC, even if not direct control, is at the top of our iist. As the programmer, we are in a position to declaratively "help" the GC but have no opportunity to do so.

@IanKemp
IanKemp commented Apr 14, 2016

"game development... has a habit of gladly abandoning the usual ways of safe code design for that 0.1% of the bottleneck code in favor of maximum efficiency. Unfortunately, there are cases where C# gets in the way of that last bit of optimization."

C# gets in the way because that's what it was designed to do.

If you want to write code that disregards correctness in favour of performance, you should be writing that code in a language that doesn't enforce correctness (C/C++), not trying to make a correctness-enforcing language less so. Especially since scenarios where performance is preferable to correctness, is an extremely tiny minority of C# use cases.

@orthoxerox

@IanKemp that's a very narrow view of C#. There are languages like Rust that try to maximize correctness without run-time overhead, so it's not one vs the other. While C# is a garbage-collected language by design, with all the benefits and penalties that it brings, there's no reason why we cannot ask for performance-oriented improvements, like cache-friendly allocations of collections of reference types or deterministic deallocation, for example. Even LOB applications have performance bottlenecks, not just computer games or science-related scripts.

@svick
Contributor
svick commented Apr 14, 2016

@IanKemp Are you saying that unsafe does not exist? C# had that from the start and it's exactly for that small amount of code where you'll willing to sacrifice safety for performance.

@SunnyWar

Hey, people...try this: write a function will result in no garbage collections....something with a bunch of math in it for example. Write the exact same code in C++. See which is faster. The C++ compiler will always generate as fast or faster code (usually faster). The Intel compiler is most often even faster...it has nothing to do with the language.

For example. I wrote a PCM audio mixer is C#, C++ and compile with the .Net, MS, and Intel compilers. The code in question had no GC, no boundary checks, no excuses.

C#: slowest
C++ Microsoft: fast
C++ Intel: super fast

In this example the Intel compiler recognized that computation could be replaced by SSE2 instructions. the Microsoft compiler wasn't so smart, but it was smarter than the .Net compiler/JITer.

So I keep hearing talk about adding extensions to the language to help the GC do things more efficiently, but it seems to me ...the language isn't the problem. Even if those suggestion are taken we're still hamstrung by an intentionally slow code generating compiler/jitter. It's the compiler and the GC that should be doing a better job.

See: #4331 I'm really tired of the C++ guys saying, "we don't use it because it's too slow" when there is _very little reason _for it to be slow.

BTW: I'm in the camp of people that doesn't care how long the JITer takes to do its job. Most of the world's code runs on servers...why isn't it optimized to do so?

@msedi
msedi commented Apr 14, 2016

I completely agree with all of the mentioned improvements. These are in my opinion absolutely mandatory. Using C# in high performance applications is the right was. It makes code much easier to read if there would be at least some of the suggested improvements. Currently we have to "leave" the language to C++ or C to create things there are not possible in C#, and i don't mean assembler instructions but very simple pointer operations on blittable Data types or generics.

So not to leave the language i created unreadable Code fragments just not to use unmanaged code because then i am dependent on x86 and x64.

@ilexp
ilexp commented Apr 15, 2016 edited

BTW: I'm in the camp of people that doesn't care how long the JITer takes to do its job. Most of the world's code runs on servers...why isn't it optimized to do so?

From a gamedev perspective, it would be neat if there was a way to tell the runtime to perform extended JIT optimization using framework API.

Let's say by default, there is only the regular, fast optimization, the application starts up quickly and all behaves as usual. Then I enter the loading screen, because I'll have to load levels and assets anyway - now would be an excellent time to tell the runtime to JIT optimize the heck out of everything, because the user is waiting anyway and expecting to do so. This could happen on a per-method, per-class or per-Assembly level. Maybe you don't need 90% of the code to be optimized that well, but that one method, class or Assembly should be.

As far as server applications go, they could very well do the same in the initialization phase. Same for audio, image and video processing software. Extended JIT optimization could be a very powerful opt-in and on runtimes that do not support this, the API commands can still just fall back to not having any effect.

Maybe it would even be possible to somehow cache the super-optimized machine code somewhere, so it doesn't need to be re-done at the next startup unless modified or copied to a different machine. Maybe partial caches would be possible, so even if not all code is super-JITed yet, at least the parts that are will be available. Which would be a lot more convenient and portable than pre-compiling an Assembly to native machine code, simply because Assemblies can run anywhere and native machine code can not.

All that said, I think both allowing the JIT to do a better job and allowing developers to write more efficient code in the first place would be equally welcome. I don't think this should be an either / or decision.

@xoofx
Member
xoofx commented Apr 15, 2016 edited

While advocating for many years about performance for C#, I completely concur with the fact that It would be great to see more investments in this area.

Most notably on the following 3 axes:

  1. Allow to switch-on a better code gen JIT (but slower). There is high hope that this will be fulfilled by the undergoing work on LLILC, for both JIT and AOT scenarios. Note that many platforms (e.g iOS, UWP/XboxOne, PS4) don't support JIT scenarios. But It will take time to achieve even performance parity with the current JIT and there are some language/runtime constraints that could make full optimization difficult (GC statepoints, array/null/arithmetic safe checks...etc.)
  2. Improve the language (with sometimes a proper JIT/GC support) that could help in this area. That include things listed above like ref locals, array slices, string slices... and even builtin utf8 strings... Some hacks can be done by post-processing IL and have been abused in many projects, but it would be great to have these little things available without making any IL voodoo.
  3. Put a lot more emphasis on memory management, data locality and GC pressure
    • Standard improvements like stack alloc for class, embeded class instance, borrowed pointers
    • Rethink our usage of the GC, while a bit more problematic, as I haven't seen much proven models in production (things like: explicit vs implicit management of GC regions to allocate known objects to a proper region of objects that would relate in terms of locality/longevity)

Unfortunately, there are also some breaking change scenarios that would require to fork the language/runtime to correctly address some of the intrinsic weakness of the current language/runtime model (e.g things that have been done for Midori for their Error Model or safe native code for example...etc.)

@svick
Contributor
svick commented Apr 15, 2016

@SunnyWar I think there's enough enough room to optimize both code generation for math and GC.

As to which one should have higher priority, keep in mind that it's relatively easy to workaround bad performance in math by PInvoking native code or using Vector<T>. Working around bad performance due to GC overhead tends to be much harder, I think.

And since you mention servers, a big part of their performance are things like "how long it take to allocate a buffer", not "how long does it take to execute math-heavy code".

@GSPP
GSPP commented Apr 15, 2016 edited

I'm adding JIT tiering to the list of features I see as required to make C# a truly high performance language. It is one of the highest impact changes that can be done at the CLR level.

JIT tiering has impact on the C# language design (counter-intuitively). A strong second tier JIT can optimize away abstractions. This can cause C# features to become truly cost free.

For example, if escape analysis and stack allocation of ref types was consistently working the C# language could take a more liberal stance on allocations.

If devirtualization was working better (right now: not all all in RyuJIT) abstractions such as Enumerable.* queries could become cost free (identical performance to manually written loops).

I imagine records and pattern matching a features that tend to cause more allocations and more runtime type tests. These are very amenable to advanced optimizations.

@OtherCrashOverride

Born out of a recent discussion with others, I think its time to review the "unsafe" syntax. The discussion can be summarized as "Does 'unsafe' even matter anymore?" .Net is moving "out of the security business" with CoreCLR. In a game development scenario, most of the work involves pointers to blocks of data. It would help if there was less syntactic verbosity in using pointers directly.

Support for SSE4 Intrinsics: CoreFx Issue 2209

This is completely useless on the billions of ARM devices out there in the world.

With regard to the GC discussion, I do not think that further GC abuse/workarounds are the solution. Instead there needs to be a deterministic alloc/ctor/dtor/free pattern. Typically this is done with reference counting. Today's systems are mutli-core, and today's programs are multi-threaded. "Stop the world" is a very expensive operation.

In conclusion, what is actually desired is the C# language and libraries but on top of a next-generation runtime better suited for the needs of "real-time" (deterministic) development such as games. That is currently beyond the scope of CoreCLR. However, with everything finally open source, its now possible to gather a like minded group to pursue research into it as a different project.

@TPoise
TPoise commented Apr 15, 2016

I'm doing a lot of high-perf / low latency work in C#. One thing that would be "the killer feature" for perf work is for them to get .NET Native fully working. I know it's close, but the recent community standups have said that it won't be part of v1.0 RTM and they're rethinking the usage for it. The VS C++ compiler is amazing at auto-vectorizing, dead code elimination, constant folding, etc. It just does this better than I can hand-optimize C# in its limited ways. I believe traditional JIT compiling (not just RyuJIT) just doesn't have enough time to do all of those optimizations at run-time. I would be in favor of giving up additional compile time, portability, and reflection in exchange for better runtime performance; and I suspect those that are contributing to this thread here probably feels the same way. For those that aren't, then you still have RyuJIT.

Second, if there were some tuning knobs available for the CLR itself.

@GSPP
GSPP commented Apr 15, 2016

Adding a proposal for Heap objects with custom allocator and explicit delete. That way latency-sensitive code can take control of allocation and deallocation while integrating nicely with an otherwise safe managed application.

It's basically a nicer and more practical new/delete.

@benaadams

@OtherCrashOverride @GSPP Destructible Types? #161

@OtherCrashOverride

Ideally, we want to get rid of IDisposable entirely and directly call the dtor (finalizer) when the object is no longer in use (garbage). Without this, the GC still has to stop all threads of execution to trace object use and the dtor is always called on a different thread of execution.

This implies we need to add reference counting and modify the compiler to increment and decrement the count as appropriate such as when a variable is copied or goes out of scope. You could then, for example, hint that you would like to allocate an object on the stack and then have it automatically 'boxed' (promoted) to a heap value if its reference count is greater than zero when it goes out of scope. This would eliminate "escape analysis" requirements.

Of course, all this is speculation at this point. But the theoretical benefits warrant research and exploration in a separate project. I suspect there is much more to gain from redesigning the runtime than there is from adding more rules and complication to the language.

@SunnyWar

@OtherCrashOverride I've also come to the conclusion that a reference counting solution is critical for solving a number of problems.

For example, some years ago I wrote message passing service using an Actor model. The problem I ran into right away is I was allocating millions of small objects (for messaging coming in) and the GC pressure to clean after they went out was horrid. I ended up wrapping them in a reference counting object to essentially cache them. It solved the problem BUT I was back to old, ugly, COM days of having to insure every Actor behaved and did an AddRef/Release for every message it processed. It worked..but it was ugly and I still dream of a day I can have a CLR managed reference countable object with an overloadable OnRelease, so that I can put it back in the queue when the count==0 rather than let it be GC'd.

@ilexp
ilexp commented Apr 15, 2016 edited

Don't want to detail the rest of it in this general overview thread, just regarding this specific point of @OtherCrashOverride's posting:

[...] than there is from adding more rules and complication to the language.

As a general direction of design with regard to future "efficient code" additions, I think it would be a good thing to keep most or even all of them - both language features and specialized API - hidden away just enough so nobody can stumble upon them accidentally, following the overall "pit of success" rule if you will.

I would very much like to avoid a situation where improving 0.1% of performance critical code would lead to an overall increase in complexity and confusion for the 99.9% of regular code. Removing the safety belt in C# needs to be a conscious and (ideally) local decision, so as long as you don't screw up in that specific code area, it should be transparent to all the other code in your project, or other projects using your library.

@svick
Contributor
svick commented Apr 15, 2016

@OtherCrashOverride

You could then, for example, hint that you would like to allocate an object on the stack and then have it automatically 'boxed' (promoted) to a heap value if its reference count is greater than zero when it goes out of scope. This would eliminate "escape analysis" requirements.

That would require you to find and update all existing references to that object. While the GC already does that when compacting, I doubt doing it potentially at every method return would be efficient.

@SunnyWar

Today, the system imposes limitations on us that are purely historical and in no way limit how things can be done in the future.

@OtherCrashOverride

That would require you to find and update all existing references to that object.

A reference should be a handle and therefore abstract whether the storage is stack or heap. The runtime references the handle, not the actual pointer, requiring only the handle itself be updated when promoting a stack pointer to a heap pointer. Again, this is all theoretical.

@xoofx
Member
xoofx commented Apr 15, 2016

@OtherCrashOverride It is misleading to think that switching from a GC to a ref counting scheme will simply solve the problem of memory management for good. The latest research on the matter are even blurring the lines between the two techniques...

Afaik, the most recent research on the subject, RC Immix conservative (2015) which is a continuation of RC Immix (2013) which is a derivation of GC Immix (2008), shows that the best RC collector is able to outperform just slightly the best GC (RC Immix conservative vs GenImmix). Note also that you need a kind of reference cycle collector for a RC Immix to be fully working (detect cycle in object graphs and collect them). You will see in these documents that RC Immix is build upon this 3 key points:

  1. The original Immix memory organization (which is a lot related to locality and friendly with CPU cache lines)
  2. Reference counting is not occuring on each object but on an Immix heap line
  3. A compacting memory scheme

Though I would be personally in favor of switching to a RC Immix scheme, mostly because it delivers a better predictability in "collecting" objects (note even with RC Immix, collection of objects is not immediate in order to achieve better performance)

That being said, again, there is a strong need for other memory management models to fill the gap in terms of performance, because the GC/RC model in itself is not enough alone (alloc class on the stack, alloc of embed instance in a class, borrowed/owned pointers (single owner reference which are destructible once there is no more owner))

@OtherCrashOverride
OtherCrashOverride commented Apr 15, 2016 edited

It is misleading to think that switching from a GC to a ref counting scheme will simply solve the problem of memory management for good.

The goal is not to solve the problem of memory management; rather, to make it deterministic. The focus is on when object cleanup takes places rather than how long it takes. Currently, games and other real-time systems suffer due to the GC halting everything for an indeterminate amount of time while collection takes place. Additionally, running finalizers in different threads causes issues for non-thread-safe GUI APIs and 3D APIs like DirectX and OpenGL. Controlling when an object is collected allows the developer to amortize this cost as needed for the program.

[edit]
Here is an example of a real-world problem deterministic memory management would solve:
http://geekswithblogs.net/akraus1/archive/2016/04/14/174476.aspx

@xoofx
Member
xoofx commented Apr 16, 2016 edited

The goal is not to solve the problem of memory management; rather, to make it deterministic.

Sure, I know what a RC is good for and that's what I implied by 😉

mostly because it delivers a better predictability in "collecting" objects

But the determinism can be reinforced by alternative forms. The following options are perfectly deterministic, and they deliver even better performance in their use cases compare to a GC/RC scenarios because they are lighter and exactly localized to where they are used:

  • Allocation of a class on the stack
  • Embed instance
  • Borrowed pointer (known also as single owner reference)

Note that in order for a RC scheme to be enough efficient, you still need to batch the collection of objects (see what RC Immix is doing). Also, you usually don't expect having to run a finalizer on every object.

If you have followed the development of a language like Rust, they started to have RC objects along Borrowed/Owner reference, but at some point, they almost completely ditch RC and they are able to live without much using it. One idea lies into the fact that most objects used in a program are not expected to be concurrently shared between threads (or even stored in another objects for stack alloc cases...etc.), and this simple observation can lead to some interesting language designs/patterns/optims that RC/GC scenarios are not able to leverage.

Bottom line is: allocation on the GC/RC heap should be the exception, not the rule. That's where deterministic allocation/deallocation really starts to shine in modern/efficient language. But that's a paradigm that is not easy to introduce afterwards without strong breaking changes.

@OtherCrashOverride
OtherCrashOverride commented Apr 16, 2016 edited

Before any more academic papers are cited, we should probably get back to the topic of discussion: "State / Direction of C# as a High-Performance Language"

My suggestions was effectively "Make unsafe code easier to use and more of a first class citizen." I will add that this includes "inline assembly" of .Net bytecode.

Other than that, I do not feel that making C# more burdened with special rules and special cases is the optimal solution (re: destructable types). Instead, modifying the runtime itself could transparently extend performance gains to all .Net languages. As an example of this, a reference counting strategy was cited. It is outside the scope to define that strategy here.

Also mentioned was that it is now possible to actually implement and test these ideas since everything is now open source. That is the point in time for discussion about implementation strategies and citing of academic papers. We can transform theory to reality and have a concrete representation of what works, what doesn't, what can be done better, and what it will break. This would be far more useful than just debating theory in GitHub comments.

Furthermore, I am explicitly against adding SSE4 intrinsics to the C# language. C# should not be biased toward any specific processor instruction set.

[edit]
To clarify: one of the points I am attempting to illustrate is that as we continue to expand across more cores and more threads using those cores, the cost of stopping all of them to do live object tracing becomes increasingly more expensive. This is where the reference counting discussion comes into play. Its a suggestion not a proposal.

@xoofx
Member
xoofx commented Apr 16, 2016

This would be far more useful than just debating theory in GitHub comments.

Precisely, I have contributed to the prototype for class alloc on the stack linked above and I'm a strong supporter to encourage .NET devs performance enthusiasts to build such prototypes.

@svick
Contributor
svick commented Apr 16, 2016

@OtherCrashOverride

A reference should be a handle and therefore abstract whether the storage is stack or heap. The runtime references the handle, not the actual pointer, requiring only the handle itself be updated when promoting a stack pointer to a heap pointer. Again, this is all theoretical.

So, to avoid allocating and deallocating an object on the heap, you instead have to allocate and deallocate a handle on the heap? That doesn't sound like a great win, especially since it also makes every dereference about twice as slow as previously.

@jonathanmarston

@OtherCrashOverride

I am explicitly against adding SSE4 intrinsics to the C# language. C# should not be biased toward any specific processor instruction set.

I agree that intrinsics shouldn't be created for all SSE4 instructions just for the sake of allowing access to SSE4 from C#, but there are some very common bit-level operations for which it would make sense to add highly optimized implementations to the framework anyway. For these, why not make them JIT to a single instruction, if available on the executing platform?

POPCNT especially comes to mind here. Having a BitCount method for numeric types built in to the framework makes sense even without intrinsics (just look how many times MS has reimplemented this algorithm across their own codebases alone). Once it's added at the framework level, why not put the little bit of extra effort in to optimize it away during the JIT when the instructions are available to do so?

@OtherCrashOverride

@svick

So, to avoid allocating and deallocating an object on the heap, you instead have to allocate and deallocate a handle on the heap?

This discussion is not intended to define a specification for memory management. It is hoped that an implementer of such a system would be competent and intelligent enough to avoid obvious and naive design issues.

@jonathanmarston
They key point is that adding something like a BitCount method is a framework and/or runtime modification. It should not require any modification to the C# language to support. I am certainly in favor of adding first class SIMD support to the framework/runtime; however, I do not believe it has a place as part of the C# language itself as SSE4 (or other architecture) intrinsics would do.

@msedi
msedi commented Apr 16, 2016

I also agree that there should be no SSE or other intrinsics in the IL. Moreover there should be a more generic abstraction of this command to let the JITter recognize that there might be some, for example vector operation. Currently writing a simple addition of two arrays produced too much IL code.

But I must admit I'm currently not aware of what the JITter really makes out of this code.

Another things is, that it really might be helpful to go back to some inline IL, something like the asm keyword in C/C++. Event Serge Lidin and some other guys wrote some pre- and postprocessor for C# to feed in IL code. But in fact I don't like this back and forth assembling and disassembling just to get native .NET things in my code.

@OtherCrashOverride

something like the asm keyword in C/C++

I also mentioned that is something I would like. It solves problems such as this: dotnet/corefx#5474

It would also be extremely useful for 'wrappers' that interop with native libraries. Currently, something like SharpDX needs to patch IL bytecode.

@playsomethingsaxman

Born out of a recent discussion with others, I think its time to review the "unsafe" syntax. The discussion can be summarized as "Does 'unsafe' even matter anymore?" .Net is moving "out of the security business" with CoreCLR. In a game development scenario, most of the work involves pointers to blocks of data. It would help if there was less syntactic verbosity in using pointers directly.

On that subject, I absolutely hate that async/iterator methods can't be unsafe. A lot of potential is lost with this blindly religious constraint.

If you want to write code that disregards correctness in favour of performance, you should be writing that code in a language that doesn't enforce correctness (C/C++), not trying to make a correctness-enforcing language less so.

C# is an anti-boilerplate language that lets you see a bigger picture, sooner. That's why we're here. C# is the only serious language that solves the confusion of continuations and that alone makes it more valuable in interop scenarios than using C++ itself. C# is fast becoming the One True Language being used in scripting AND game engines.

Especially since scenarios where performance is preferable to correctness, is an extremely tiny minority of C# use cases.

If you don't pay attention to performance, you don't scale.

@jcdickinson
jcdickinson commented Apr 19, 2016 edited

If you want to write code that disregards correctness in favour of performance

  1. If correctness must suffer, mark it as such with unsafe.
  2. See: Rust. Correctness does not need to suffer for performance.
  3. From #118. The lack of some of these constructs actually limit the correctness of safe programs:

Interestingly, that support in the CLR is actually a more general mechanism for passing around safe references to heap memory and stack locations; that could be used to implement support for ref return values and ref locals, but C# historically has not provided any mechanism for doing this in safe code. Instead, developers that want to pass around structured blocks of memory are often forced to do so with pointers to pinned memory, which is both unsafe and often inefficient.

@DemiMarie

My thoughts:

  • The CLR support for safe references needs to be exposed in full.
  • SIMD instructions should be a library/runtime feature, not a language feature. They should be JIT intrinsics rather than IL opcodes.
  • With regards to deterministic destruction: I think a major gain would be a real-time garbage collector. I know of a few techniques:
    • IBM's Metronome
    • Azul Systems's C4 (unfortunately patented and requires kernel support)
    • Red Hat's Shenendoah

Unfortunately, I don't know if all of them could handle the requirements of .NET. Failing that, I think that a fully concurrent, non-compacting mark/sweep collector might help. I believe that [the new garbage collector proposed for LuaJIT][1] represents the state of the art – although it does not consider parallelism or concurrency it seems to me to be a type that could be adapted to this situation.

With regard to multicore scaling: I do not believe that a single shared GC heap can scale unless the GC is aware of it AND memory allocation rates can be controlled (to avoid becoming memory bandwidth bound).

Sadly, I don't think C# can ever become as fast as Rust, except in areas such as memory allocation. Rust was designed to be both fast and correct, and had to trade off some usability to do it, such as with explicit lifetime tracking.

@SunnyWar

Read this: Introducing a new, advanced Visual C++ code optimizer

I did a simple test to see if .Net did even one of the most basic of substitution optimizations %2 -> &1. nope.

So we can talk about all kinds of things to make the algorithms, data structures, and language faster but it seem to me that a more fundamental problem needs to be addressed first.

The C++ teams and the .Net teams need to share knowledge. Roslyn and the CLI should look at every optimization that C++ compiler and linker makes and figure out how, if possible, they can get them into .Net. Until that happens, .Net will always be the slow step child to C++.

@paul1956

This is just my guess but the C++ compiler has always been pushed/helped by Intel optimizing C++ compiler (I know the reverse is true) no such competition exists for .Net. Intel has (had?) whole teams looking to get every last cycle out of their CPU's in order to beat competing architectures and it was cheaper/faster to make compiler better that the CPU.

@GSPP
GSPP commented May 12, 2016

@SunnyWar that is true. Another one is that a.x + a.x is translated to two loads instead of one. Some of the most basic optimizations are not implemented.

RyuJIT brought loop cloning which is a big win. This needs to be acknowledged as well. It's not all bad.

Given that C compilers differ only in a few percent (+-20% on real code?) of performance I doubt that the Intel C compiler was a tool suitable to beat the competition.

@benaadams

C++ compilation also takes a very very long time

@playsomethingsaxman

NGEN is supposed to do all the work on the dev's computer so it doesn't have to use up CPU on the user's computer. Yet NGEN doesn't optimize much of anything. It's a tragedy.

.NET original development seems to be lax in general. They won't say this but I suspect the whole move to "universalize" the core and replace the Win10 version with WinRT underneath is based on a grossly non-scaling Framework implementation, like I/O that uses event signalling instead of IOCP packets.

@HaloFour

@playsomethingsaxman

The "ngen" tool has nothing to do with the developer's computer. It's supposed to generate cached optimized images on the user's computer during an installation process.

If "ngen" or the JIT is not performing basic optimizations then that should be considered either a bug or a feature request of the CLR, not of Roslyn. It's not Roslyn's place to perform aggressive optimizations. Doing so would be counterproductive as it would make it more difficult for the JIT to do its job. IL expresses intent, not implementation.

@SunnyWar
SunnyWar commented May 12, 2016 edited

@HaloFour I disagree...kinda. If Roslyn can produce a result that is easily highly optimizable, it should do so. The .NET team has said over and over and over that they care more about how fast the JIT loads the assembly than they do about how fast the executable runs. Until that changes, the JIT is NOT going to do the kind of optimization we all want.

So...that leaves us with a big hole. JIT isn't going to do it. Roslyn thinks it's not it's problem...so nothing happens.

And the consumer get's sub-optimal programs, the server runs sub-optimal code. The cost for server farms running .Net apps goes up. The number of ASP.Net servers you need goes up. The number of Exchange servers is higher than necessary. It's a lose, lose situation.

@HaloFour

@SunnyWar

As mentioned in #11268 (comment) high-ranking members of the Roslyn team were on the Java team when byte code optimizations were removed from javac because it was demonstratively counter-productive.

@GSPP
GSPP commented May 12, 2016

Yes, C# is not the place to perform optimizations except if they are only feasible at the language level.

But the C# compiler already breaks that rule by optimizing nullables in a variety of ways. This is painting over the JITs inabilities. I don't question that choice, I think that's sensible.

I don't think this discussion thread can avoid JIT quality entirely because JIT quality influences C# design decisions. (Example: Better escape analysis in the JIT could make C# more lenient towards introducing allocations.)

@playsomethingsaxman

Yes, C# is not the place to perform optimizations except if they are only feasible at the language level.

I cannot believe I'm even hearing this. You know the saying: garbage in, garbage out. You can optimize a lot of garbage into really high-quality garbage, or optimize non-garbage. I prefer the latter.

@DemiMarie

Here is my thought:

  • The CLR really needs a tiered JIT if it is to compete with C++ for performance. There is no way that a method-based JIT can be fast enough both in terms of compilation time and performance of generated code. It also needs speculative optimization and deoptimization: being able to optimize based on what is likely, and to throw away the code if the assumption fails.
  • An aggressive AOT compiler – most likely LLVM-based, but with a CIL-specific optimizer before LLVM – is also important. The biggest gains could be achieved by a whole-program compiler that used knowledge about the entire program and all libraries to perform such optimizations as devirtualization. Finally, IMO there is no way around AOT compilation for non-server apps that need to start within user's reaction time.
  • LLVM (in LLILC) or the C++ compiler (in IL2CPP) will not be able to perform CIL-specific optimizations. Apple's Swift compiler has its own intermediate language and optimization passes that run on it, and Mozilla's rustc will soon be getting its own optimizations. In the context of .NET Core, these optimizations are most likely language-independent but CIL-specific, so a front-end optimizer after the CIL reader is needed. These optimizations include:
    • Inlining
    • Eliminating array bounds checks.
    • Alias analysis
@damageboy

@playsomethingsaxman:

On that subject, I absolutely hate that async/iterator methods can't be unsafe. A lot of potential is lost with this blindly religious constraint.

I totally agree with you :)
I recently tried to take it up on the roslyn gitter and got into a semi-serious discussion with @agocke about this on April 14th, if you follow it through: https://gitter.im/dotnet/roslyn?at=570fcf5b548df1be102ceded, it's pretty much me, @jmarolf and @agocke for a good couple of pages.

It seems like we gravitate towards the end to a conclusion that it might be ok, although @agocke is really not sure about the voodoo effects of allowing such code...
@agocke mentioned that the main hurdle right now is that fact that the async/yield state-machine generator (for lack of better naming) is incapable of doing stack-spilling for unsafe value types into the state-machine...

@damageboy

@drbo Can you elaborate, perhaps with concrete examples what it is you exactly mean by "CIL-specific" optimizations?

@benaadams

Yes, C# is not the place to perform optimizations except if they are only feasible at the language level.

I cannot believe I'm even hearing this. You know the saying: garbage in, garbage out. You can optimize a lot of garbage into really high-quality garbage, or optimize non-garbage. I prefer the latter.

@playsomethingsaxman if the optimizations can be done at the cil/il level then it improves all languages not just C# is what I think he meant?

@SunnyWar
SunnyWar commented May 13, 2016 edited

Concerning cil/il level optimizations... here is an example:
return x % 2 != 0 ? 4 : 2;

IL_0000: ldarg.0
IL_0001: ldc.i4.2
IL_0002: rem
IL_0003: brfalse.s IL_0007
IL_0005: ldc.i4.4
IL_0006: ret
IL_0007: ldc.i4.2
IL_0008: ret

Compared to a function that produces identical results:
return (x & 1) != 0 ? 4 : 2;

IL_0000: ldarg.0
IL_0001: ldc.i4.1
IL_0002: and
IL_0003: brtrue.s IL_0007
IL_0005: ldc.i4.2
IL_0006: ret
IL_0007: ldc.i4.4
IL_0008: ret

The IL for the second with produce faster machine code on all known environments than the first. So this level of optimization (pattern matching, as exists in all optimized C++ compilers) should be done not by the backend compiler but by the front end compiler. Essentially I'm claiming that the IL in both should have been the same and should have been the second one.

@playsomethingsaxman

So this level of optimization (pattern matching, as exists in all optimized C++ compilers) should be done not by the backend compiler but by the front end compiler.

I will confess that the C# compiler is bewilderingly fast and I am in total admiration of the team like you have no idea. For rapid debugging, that's crucial. For release builds, it's almost the exact opposite: let the build machine bear the cost so the end-users don't have to.

I totally agree with you :) I recently tried to take it up on the roslyn gitter and got into a semi-serious discussion with @agocke about this on April 14th, if you follow it through

You are my hero. All I want to do is dereference a pointer, I don't need "fixed" or managed reference management or anything clever. They will argue, "you should be using C++", to which I would respond, "I can outscale C++ on C#6 Win32 interop because async/await inspires more powerful custom algorithms that would make a C++ dev shriek".

@SunnyWar

I honestly don't understand the "you should be using C++" mentality. I know it existing out in the real world but it appears to have infected the thinking of the .Net teams. I think they should be striving to replace C++ with C# and other CIL languages in nearly all areas of performance. It's appears to be doable. We just need to do it.

Some might say, "no, the garbage collection is always a tax that will slow things down". Yes and no. There are lot's of places where the GC does a lot of work today that is unnecessary. Eliminate those and the excuses start to drop off. For example: dotnet/coreclr#4365 and dotnet/roslyn#161

Then there are the algorithms (which is arguable the bulk of all code) that don't do any garbage collection. In those cases I see very few barriers to very high performance. I'm often surprised at how much fast C++ can do basic math. Why? There's no excuse other than they just haven't done it. Simple as that.

As a matter of fact, C# does a better job than C++ in capturing the "intent" of the developer. This means the compiler should, in theory, be able to produce even more optimized machine code than C++. Add to it the knowledge the JIT has knowledge of the specific hardware and C++ should be the slower option.

BTW: All of that was promised to us back when C# was first introduced. It's time the promise becomes a reality.

@agocke
Contributor
agocke commented May 13, 2016

@damageboy @playsomethingsaxman You now have two people asking for it, so I take this request more seriously now :)

If you want to create an issue, that works for me.

@damageboy
damageboy commented May 13, 2016 edited

@agocke we are few and have a few loose nuts, but pleasant otherwise :)

As long as we're on the subject of CIL / C# Optimization, has there been any discussion of converting Linq constructs into fast, allocation-less imperative code, such as what is being done here:
https://github.com/nessos/LinqOptimizer

This looks like a grand optimization/feature for roslyn: Linq!, now with actual performance :)

@agocke
Contributor
agocke commented May 13, 2016

@damageboy We want to fix this at a lower level, by optimizing delegates and possibly providing a new, lower allocation API instead of IEnumerable (IFastEnumerable?).

@jaredpar Has a bunch of crazy ideas about this.

@playsomethingsaxman

We want to fix this at a lower level, by optimizing delegates and possibly providing a new, lower allocation API instead of IEnumerable (IFastEnumerable?).

@agocke Don't forget, we still BADLY need IAsyncEnumerable! I can't tell you how bad!

@SunnyWar

@damageboy I've been hoping for a LinqOptimizer for years! Here you have a functional and very expressive representation of intent and it's slower a the procedural alternative. What the __??

I don't recall the last dev work I worked with that didn't ban Linq for performance reasons...and it's should be opposite. So sad.

@mattwarren

I don't recall the last dev work I worked with that didn't ban Linq for performance reasons...and it's should be opposite. So sad.

Including Roslyn itself!

But to-be-fair, only in hot-paths, see the coding guidelines

@DemiMarie
DemiMarie commented May 13, 2016 edited

@damageboy By "CIL-specific optimizations" I mean optimizations that are relevant for languages that compile to CIL, but not to C or C++.

LLVM is a fantastic compiler optimizer and backend. However, it only knows about constructs found in C-like languages and lacks optimizations that aren't relevant to C. These include:

  • escape analysis
  • closure-related optimizations
  • optimizing LINQ
  • allocation removal
  • possibly devirtualization

Furthermore, LLVM is slow. Not by standards of C compilers, but it is slow for a JIT. The LLVM developers are working on improving this speed, but for now there are no optimizers that are fast enough AND generate sufficiently fast code, and this may not change. That is why tiered JIT compilation and/or AOT compilation is crucial: applying heavy optimizations to all code at runtime simply takes too long. Only an AOT compiler or a higher-tier JIT can afford to take the amount of time needed.

@DemiMarie

@SunnyWar One problem is that unlike Haskell, C# is not purely functional. That means that most of the rewrites you would like to make are not necessarily valid. GHC's solution is to apply rewrite rules that automatically rewrite the code. These rules are valid for Haskell, but they are not necessarily valid for C# as they may change evaluation order.

@xen2
xen2 commented May 14, 2016 edited

Concerning IEnumerable<T>, one problem is lack of devirtualization by language design. Basically you have to pay the price of interface call + GC allocation in every generic method that takes a IEnumerable<T> and do a foreach loop on it.

Ideally, it would be nice if, in some specific case, we could take a more "template" approach to the problem, where I could write a method like LINQ Sum() based on traits/contracts (there is work in progress on that, but not sure it would allow such cases):

Sum<T, U> (T enumerable) where  T has Trait { U GetEnumerator(); }
{
    foreach (var i in enumerable)
    {
    }
}

Idea is to force this kind of code to be JIT and optimized for each separate type using its overload of GetEnumerator() returning a struct (most .NET collections provide them).

This kind of technique are used widely in C++ std library (and quite powerful when combined with allocation-less functors for std::sort(), etc...)

Right now, this is impossible in C# short of rewriting one method such as Enumerable.Sum() per type you want to optimize (basically doing template instantiation job manually). As a result, you pay the "virtualization" price of IEnumerable (interface calls + GC allocation + boxing) almost all the time, even when the compiler did have more semantic information when compiling that specific generic instantiation. This prevent lot of straightforward inlining/optimizations as well.

I agree this shouldn't be done in all cases (as this dramatically increase code size), but in various scenario (esp. small helper methods such as LINQ, Sort(), etc...), it could go a long way to avoid unecessary GC pressure and also improve performance on tight loops.

@benaadams
benaadams commented May 14, 2016 edited

@xen2 bit horrible, but something like?

interface IEnumerable<out T, TEnumerator> where TEnumerator : struct, IEnumerator<T>
{
    TEnumerator GetEnumerator();
}

public T Sum<T, TEnumerable, TEnumerator>(TEnumerable e)
    where T : ...
    where TEnumerable : IEnumerable<T, TEnumerator>
    where TEnumerator : struct, IEnumerator<T>
{
    T sum;
    foreach(var value in e)
    {
        sum += value;
    }

    return sum;
}

Assuming there was something for + operator

@playsomethingsaxman

Ideas for speed (not sure which is language and which is CLR):

  • 128-bit integers (Int128/UInt128) with matching literals. Then I could define COM GUIDs in large tables that are much faster than Guid.Parse() calls. Wouldn't be bad for IPv6 addresses either. I bet we would easily find new uses even.
  • C++ constexpr-like methods, run by the compiler and translated to a reduced form (preferably a constant). Then a version of Guid.Parse() could generate a 128-bit integer instead of 36 Unicode characters (36 * 16 = 576-bit) and a parse routine, while at runtime a Guid constructor accepts a 128-bit integer. Another example is a lot of the interop stuff like MAKE_HRESULT; if you could use constexpr methods in enum definitions then maintainability GOES THROUGH THE ROOF without all these middling reflection operations dragging the CPU down. There's way more than this.
  • Allow an "unsafe struct" generic constraint so I can get the address of an enum and not have to use Marshal.StructureToPtr() or its counterpart.
@benaadams

@playsomethingsaxman if you are dealing with already fixed memory then you might want to check out the System.Runtime.CompilerServices.Unsafe library

https://github.com/dotnet/corefx/blob/master/src/System.Runtime.CompilerServices.Unsafe/src/System.Runtime.CompilerServices.Unsafe.xml

@playsomethingsaxman

If you are dealing with already fixed memory then you might want to check out the System.Runtime.CompilerServices.Unsafe library

That code gets me thinking... remember those __asm scopes in C++? Why can't we have the same for IL in C#? Or heck, why not just make it a method specifier and be able to plop IL methods right between C# methods? I used to hate having to run MASM separately, used to hate disturbing that locality.

@benaadams

but __asm scopes would be processor specific so would require more ifdefs and validation?
/cc @migueldeicaza

@playsomethingsaxman

but __asm scopes would be processor specific so would require more ifdefs and validation?

I meant IL, not asm. This crazy business:

    .maxstack 2
    ldarg.0
    ldarg.1
    ldobj !!T
    stobj !!T
    ret
@adamchester
adamchester commented May 15, 2016 edited

@playsomethingsaxman F# does this in some special places

Edit: fixed link

@aL3891
aL3891 commented May 15, 2016

@playsomethingsaxman This would enable some tasty optimizations, if not only in mscorlib that cant depend on things like Unsafe (I'm sure there's ways around that but this would make it easier)

@svick
Contributor
svick commented May 15, 2016

@playsomethingsaxman

  • You seem to be focusing on the speed of Guid.Parse() quite a lot, is that really a bottleneck for you? And isn't the Guid(int, short, short, byte, byte, byte, byte, byte, byte, byte, byte) constructor suitable for you (especially since that is how Guid is internally represented)?
  • Regarding inline IL, what could you achieve using that that you can't already do using C# or S.R.CS.Unsafe?
@playsomethingsaxman
playsomethingsaxman commented May 16, 2016 edited

You seem to be focusing on the speed of Guid.Parse() quite a lot

Guid.Parse() is just an example everyone understands. But if an actual problem is the price of admittance, I have long lists of them in static classes that run linearly in the static constructors.

Regarding inline IL, what could you achieve using that that you can't already do using C# or S.R.CS.Unsafe?

That question is way beyond my pay grade.

Here's another strong idea: give IntPtr/UIntPtr full operator support like int/uint, maybe even intptr/uintptr keywords (YES PLEASE). I can't tell you all the places we have to use proxy variables, casting and copying, casting and copying. It got so freaking out of control that I had to create my own IntPtrEx/UIntPtrEx structs (sadly they don't perform in tight loops yet). Right now most of us cheat and use long for managed-ish pointer arithmetic because long fits both 64-bit and 32-bit. But will we remember to update that code when the the 128-bit omg word size comes out? This is one of those situations where excessive dogma will actually force people into producing less provable code.

@andre-ss6

I tend to agree with others that what we really need right now is a [better] AOT compiler and/or a better JIT.

I have the impression that C# is already fast enough when dealing with "simple" code (structs, no virtual calls, only old-school imperative code (no delegates, no linq)).
What is the point of having a lot of new performance features but having to write C++-style code? (OK, maybe an exaggeration here, but hey) Then why just not go back to C++? I know, I know, we want to use C#. I'm not one of these "go to C++" folks; my point is: high performance code in C# should still feel like C#.

I think we first need better compilers. Then new language features.

@playsomethingsaxman

The request to have at least some hinting to the GC, even if not direct control, is at the top of our iist. As the programmer, we are in a position to declaratively "help" the GC but have no opportunity to do so.

Not only do we not have the opportunity, we are actively told not to, in the hope that a few bright people in our camp know when to ignore the advice.

What is the point of having a lot of new performance features but having to write C++-style code?

You are probably already writing C++-style code if you're any bit worried about performance beyond Big O. You won't even have a choice if any of the .NET Framework's classes under-perform, like ObservableCollection<> which doesn't offer range operations and causes many a XAML list to populate slowly.

@DemiMarie
DemiMarie commented May 19, 2016 edited

That is why I am so interested in LLILC AND in a tiered JIT.

LLILC can perform as many optimizations as Clang can, and Rust has shown that array bounds checks do not create a huge problem. But LLVM is slow – too slow for a JIT that needs to compile all code while the user is waiting.

An AOT compiler or a tiered JIT compiler with deoptimization is the solution. That way LLVM only sees a much smaller amount of code while the user is waiting – or can see all of it, offline.

@ilexp
ilexp commented May 19, 2016

I think we first need better compilers. Then new language features.

Relying on the compiler to guess what we're doing and how to optimize it has limits, because it can only assume so much. Having an explicit way of performing certain optimizations as a programmer is more reliable across runtimes and configurations and can have a greater effect, because we as programmers know a lot more context.

I agree that better compilers would be nice. I disagree that "we need them first". Having the right tools for performing certain optimizations explicitly is just as important.

@OtherCrashOverride

It doesn't really matter how optimized or awesome a JIT compiler is when the garbage collector has stopped all threads of executions to do its work. This is the issue game developers face: everything freezes when a GC happens regardless of how optimized the code is. Its possible to reduce garbage, but not eliminate it entirely in real world code.

@GSPP
GSPP commented May 21, 2016

@OtherCrashOverride shouldn't the rather new background GC modes fix that issue? Assuming pause times for G0 and G1 are acceptable at any time.

@OtherCrashOverride

shouldn't the rather new background GC modes fix that issue?

I doubt it will have any real world affect. Graphic APIs typically issue all drawing on a single thread. When that thread is frozen, so is rendering, resulting in a defect perceptible to the end user. Someone will need to test and measure.

@GSPP
GSPP commented May 21, 2016 edited

@OtherCrashOverride maybe you should research what that mode does. It promises to only ever pause threads briefly except under rare circumstances.

You seem to be a game developer and I have wondered for a long time whether that mode solves the problem. I'd like your opinion on it.

@OtherCrashOverride

It promises to only ever pause threads briefly except under rare circumstances.

This goes back to the deterministic discussion. If you can't predict when and how long, you can't avoid rendering stuttering. You can make everything massively parallel but you still have to "join" when its time to render the current state. This means you will only ever be as fast as the GC despite any compiler and/or code optimizations.

At a minimum, the GC would need to be schedulable and interruptible. The program would need to be able to say "the runtime can GC at this point, but never before or after" and "the GC has this many nanoseconds of time to accomplish whatever it can". This then leads to a possible slow build up of garbage if there is not enough allocated time to clean it all up. This is the reason a reference counting strategy was mentioned earlier. Its completely deterministic and avoids these issues.

Right now there appears to be a "civil war" of sorts in the .Net world. CoreCLR's "customer" is ASP.Net. This puts the feature focus on things that make sense in that world. An example is the JSON/XML/XAML project file battle. After CoreCLR ships and the "dust settles", then would be the time to revisit issues such as these. Currently, not even the tooling is available to make CoreCLR useful in any scenario outside ASP.Net.

@OtherCrashOverride

Btw...

As eluded to earlier, there is still an issue with the GC performing its duties on a different thread of execution. GUI APIs and graphics API's such as OpenGL are not thread safe: your operations must be performed on specific threads. A concrete example is creating a GPU resource in OpenGL and destroying it as part of a dtor. Currently, there is no way to do this in .Net. You have to delegate the job from the dtor (finalizer) back to the appropriate thread. The means GUI and graphics API's typically end up using IDisposable which requires special code support to check if the object is disposed at every possible execution point. It also requires that code manage the lifecycle and call Dispose() only when appropriate effectively making the GC system pointless.

@playsomethingsaxman
playsomethingsaxman commented May 21, 2016 edited

It also requires that code manage the lifecycle and call Dispose() only when appropriate effectively making the GC system pointless.

I think serious code is ALWAYS going to be deterministic, especially when freeing resources that can only be freed on the creating thread. The GC's divinity is with handling all the tedious collection types in between, and making sure that SafeHandles are only leaked to the finalizer. When writing frameworks I allocate from the unmanaged heap as much as possible while letting the higher glue code "play .NET".

@DemiMarie

The only solution for games is an almost-fully concurrent collector with strictly bounded pause times. Such a collector may require a read barrier and have reduced throughput. Azul's Zing JVM does this, but the algorithm it uses is patented. The Shenandoah collector proposed for OpenJDK may not be able to support interior pointers. The only other option that I know of is a non-moving collector.

@benaadams

Are you using the Server GC which mostly runs in parallel? Also are you setting GCLatencyMode.SustainedLowLatency mode during the times when you can't allow blocking collections?

https://blogs.msdn.microsoft.com/dotnet/2012/07/20/the-net-framework-4-5-includes-new-garbage-collector-enhancements-for-client-and-server-apps/

@SunnyWar
SunnyWar commented May 23, 2016 edited

I know I can sound very critical some times, so I decided to take a look at what performance improvement have been mad, or are being made, in the CLR. When I choose the "label:optimization is:closed label:"4 - Done" it comes up with NO results. So then I looked at, "label:performance is:closed label:"4 - Done"...no results. So then I looked at, "is:open label:performance label:"2 - In Progress" and I see only one (thank you @stephentoub). And "is:open label:optimization label:"2 - In Progress" give no results.

So I hear a lot of talk and a lot of great ideas, and I see a many things put in the "future" bin for CLR but at the moment very, very little appears to be actually happening.

@stephentoub
Member

very, very little appears to be actually happening

Lots of PRs to both coreclr and corefx have been merged to improve performance. Many just haven't been labeled as such.

@OtherCrashOverride

Are you using the Server GC which mostly runs in parallel? Also are you setting GCLatencyMode.SustainedLowLatency mode during the times when you can't allow blocking collections?

That question should probably be directed at the Unity3D guys. Since their targets include non x86 devices like ARM based mobile phones, I doubt its a feature they could take advantage of. In such constrained environments they do no have the luxury of throwing more cores or more RAM at the problem.

The GC discussion should probably be spun off to a different topic. I think everyone is in agreement that GC does affect performance; however, there is little C# as a language can do about it. In fact, "inline IL" is about the only language feature being asked for. The rest are runtime or framework improvements.

@benaadams

In fact, "inline IL" is about the only language feature being asked for.

Is this covered by? #11475

@agocke
Contributor
agocke commented May 23, 2016

@OtherCrashOverride Actually, the purpose of the ref-locals/ref-returns feature is to partially address these GC considerations. It makes it much easier to work with structs without allocation in critical paths.

@OtherCrashOverride

Is this covered by? #11475

I was hoping to see inline IL. I don't think adding a pseudo-language is the answer (it should be enough to learn IL, not additionally an IL for IL). I'm really tired of flame wars, so I will leave that topic alone.

Actually, the purpose of the ref-locals/ref-returns feature is to partially address these GC considerations.

Something like that can already be done today by using a single element array of a struct as a parameter (and keeping it around). Its syntactically much cleaner than that proposal (IMHO). Its often a solution when you need to use the struct for an interop (GCHandle.Alloc) return result. Again, its not worth the flame war, so I will also leave that topic alone too.

Rather than debates and drama, its probably better to just fork everything after Release and try out alternatives. Then everyone can make more informed feature choices.

@jcdickinson
jcdickinson commented Jun 2, 2016 edited

Something like that can already be done today by using a single element array of a struct as a parameter (and keeping it around). Its syntactically much cleaner than that proposal (IMHO).

This can also be achieved with ref parameters, which it is in many game frameworks. In terms of code clarity consider:

// Without ref-returns and ref parameters for operators, this copies (fermi estimate, probably less):
// 4 rows * 4 columns = 16 cells 
// 16 cells * 4 bytes for float = 64 bytes per stack frame
// 64 bytes per stack frame * (1 copy for call + 1 copy for return) = 128 bytes per call
// 128 bytes per call * 2 calls = 256 bytes
// However, the code is much clearer.
var wvp = world * view * projection;

vs.

// This is zero-allocation, workaround for the lack of ref-returns and ref parameters for operators.
// However, the code makes developers cry blood.
var wvp = new Matrix();
Matrix.Multiply(ref world, ref view, out wvp);
Matrix.Multiply(ref wvp, ref projection, out wvp);

Arguably DOP would alleviate this (as it does use arrays); however, something so simple shouldn't have such poor readability and should not deceptively copy so much memory.

@martinvahi

If You are planning future-proof high-performance features for C#, then please do not leave out the research done by Tucker Taft on his ParaSail programming language.

@AlgorithmsAreCool

@martinvahi Are there any specific parts of interest in ParaSail that would be applicable to C#?

@zpodlovics zpodlovics referenced this issue in Microsoft/visualfsharp Jul 20, 2016
Open

How to drive F# Adoption - Part 4 #1339

@martinvahi

@AlgorithmsAreCool I do not know the answer to Your question yet, but I do know that I'm not capable of keeping track of tens of threads manually, which means that the "data parallelism", call it map-reduce, if You will, is currently the only way for a person like me to efficiently utilize multi-core CPU-s. The way I understand it in 2016_07, the general idea behind algorithm parallelization is that loop iterations that do not depend on each other, can be executed in any order, including on separate CPU-s or even on different machines. From that point onward it's a matter of creative thinking, how to re-arrange a program to as-parallel-as-possible components while keeping in mind that no program can be run in less time than it takes for its un-parallelizable sequential parts to execute (the Amdahl's law.).

May be the Abstract Syntax Tree can be somehow analyzed and re-arranged to utilize N threads as efficiently as possible. I do not know the answer yet, but the ParaSail is certainly one thing to study, when planning for new performance features.

@svick
Contributor
svick commented Jul 20, 2016

@martinvahi

From the page about ParaSail you linked to:

given a ParaSail expression like F(X) + G(Y), the language rules ensure that it is safe to evaluate F(X) and G(Y) in parallel

I don't see how something like that could work on the CLR. .Net does have global state, parameter aliasing, pointers (both actual pointers and references, which I think are effectively the same thing in this case), etc.

This might not make automatic parallelization impossible, but it does make it much harder.

@martinvahi
martinvahi commented Jul 21, 2016 edited

@svick Currently I do not know the ParaSail well enough to answer to Your comment from ParaSail point of view, but simply by looking at Your comment I can propose that the

sum = F(X) + G(Y)

can be rewritten as

array_of_functions = [F, G]
array_of_args = [X, Y]
throw if  array_of_args.length != array_of_functions.length
i_len = array_of_args.length
x_whatever_useless_value=42
array_of_function_values=new Array(i_len, x_whatever_useless_value)
loop(i=0 ... (i_len-1) ){
    // All iterations of this loop are independent of each other.
    // Therefore the iterations can be executed in parallel, on different CPU-s.

    func = array_of_functions[i]
    arg = array_of_args[i]
    array_of_function_values[i] = func(arg) // by avoiding push() there is no need to lock the array
} 
sum=0
array_of_function_values.each { |x_value|   sum += x_value } // is sequential activity

That is to say, may be the first step might be a translator from C# to C#. I suspect that just like GCC has different compiler options, optimization options, there will also be a set of parameters that determine, how to translate one C# code to another. (I understand that my example here was not C#, but in pseudocode).

@HaloFour

@martinvahi

Not without proving that F, G, X and Y share absolutely no state, which is something that neither the C# compiler nor the CLR can guarantee.

@benaadams
benaadams commented Jul 21, 2016 edited

@martinvahi You can manually do it with Parallel.For https://msdn.microsoft.com/en-us/library/dd460703(v=vs.110).aspx

e.g.

int[] nums = Enumerable.Range(0, 1000000).ToArray();
long total = 0;

// Use type parameter to make subtotal a long, not an int
Parallel.For<long>(0, nums.Length, () => 0, (j, loop, subtotal) =>
{
    subtotal += nums[j];
    return subtotal;
},
    (x) => Interlocked.Add(ref total, x)
);

Console.WriteLine("The total is {0:N0}", total);
@martinvahi

@HaloFour Having little is better than having nothing. (I'm not complaining here, I'm just paraphrasing.) May be the parallelizaton can be limited to some simpler cases then, something, where compile-time static analysis can determine that the functions do not share any state. That can be computationally terribly expensive and slow, but the use of that feature can be optional.

@HaloFour

@martinvahi

Even if the compiler could prove that it couldn't necessarily prove what those functions do, how long they're supposed to take, etc. Short of being insanely simple I doubt that the compiler could do a good job of performing that static analysis anyway. And if they were insanely simple then they very likely wouldn't benefit from parallelization. And even if the feature were provided the developer would need to have ways of controlling it, requiring new language constructs to hint to the compiler how things would need to be parallelized anyway. Getting it even slightly wrong would inevitably result in both broken state and considerably slower execution.

With TPL, DataFlow, Rx and async/await it's quite simple to manage threads in C#, even if it is somewhat manual.

@playsomethingsaxman

Everyone is aware that Parallel.For() blocks the calling thread and can only scale to a certain point, yes?

@stephentoub
Member

Parallel.For() blocks the calling thread

It uses the current thread in the processing of the loop, just as does a regular for loop. If you don't want it to use the current thread, you can wrap it in a Task.Run.

@playsomethingsaxman

It uses the current thread in the processing of the loop, just as does a regular for loop. If you don't want it to use the current thread, you can wrap it in a Task.Run.

That shoves the problem under the carpet if I need to run many enumerations at once in server code. You shouldn't need a thread to run a for-loop, it should be advanced cooperatively by completed iteration tasks. I had to write my own ForAsync and ForEachAsync.

@stephentoub
Member

Parallel.For is for compute-bound work.

@svick
Contributor
svick commented Jul 21, 2016

@playsomethingsaxman I think parallel (or is it concurrent?) asynchronous foreach would be a great addition to corefx. The issue about is here: dotnet/corefx#699.

@jcdickinson
jcdickinson commented Jul 21, 2016 edited

@HaloFour: I doubt that the compiler could do a good job of performing that static analysis anyway.

Agreed. CodeContracts makes a pretty good attempt at it, but build times really suffer (to the degree where by default it does the post-build analysis in the background). I think the ParaSail story is something that your runtime needs to be designed for in the first place, it's not something that is reasonably possible to iterate into.

@DemiMarie

I think such analysis might be possible by a sufficiently smart compiler.

Such compilers do exist. Good examples are MLton (SML), Stalin (Scheme),
and possibly the Felix compiler. These all use a closed-world assumption:
the source code of the entire program must be available, and loading of
code at runtime is not allowed. This allows for heavyweight control- and
data-flow analysis to figure out what function is called at each location
in the program and even the range of values. I suspect that such analysis
could also be used for parallelization.

On Jul 21, 2016 5:00 PM, "Jonathan Dickinson" notifications@github.com
wrote:

@HaloFour https://github.com/HaloFour you are most definitely correct:

I doubt that the compiler could do a good job of performing that static
analysis anyway.

CodeContracts makes a pretty good attempt at it, but build times really
suffer (to the degree where by default it does the post-build analysis in
the background). I think the ParaSail story is something that your runtime
needs to be designed for in the first place, it's not something that is
reasonably possible to iterate into.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#10378 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGGWB1a9Lyg17Z6Jy9thr07_zyp6nfJLks5qX93_gaJpZM4IBbvF
.

@agocke
Contributor
agocke commented Jul 21, 2016

sufficiently smart compiler

This is a myth for a reason. ;)

@DemiMarie

Depends on the definition: if you mean a compiler that can make your
high-level code outperform C, then Felix and Stalin quality.

On Jul 21, 2016 7:01 PM, "Andy Gocke" notifications@github.com wrote:

sufficiently smart compiler

This is a myth for a reason. ;)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#10378 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGGWB2ZPVlvHrOdoSGGT860i7yE_0kp4ks5qX_o-gaJpZM4IBbvF
.

@sttaft
sttaft commented Jul 22, 2016

@jcdickinson ... I think the ParaSail story is something that your runtime needs to be designed for in the first place, it's not something that is reasonably possible to iterate into.

Either that, or you need to add some language feature similar to the "Global" annotations of a language like SPARK, which are enforced by the compiler, and which indicate the global data that can be read or updated by a given function. You probably also want to keep track of whether a function is potentially blocking (i.e. could end up waiting on the equivalent of a semaphore), as that affects the safe scheduling of work items (aka pico-threads) onto worker threads, using a mechanism such as work stealing.

@martinvahi

The ParaSail was an example of a cutting edge project, but I think that in addition to the cutting edge things less cutting-edge projects can be a source of inspiration. For example, Babel High-Performance Language Interoperability, Chapel parallel programming language. The Chapel has a nice technical background description at a PDF file.

As of 2016_08_01 do not know any of those projects, but one of the reasons, why I'm always on a lookout for different technologies is that I always want to use the most skillfully crafted components at my software and according to my current (2016_08_01) view the most skillfully crafted components are the components that have been created by people, who have taken their time for working on the component while having really good domain knowledge and are either stubborn enough to resist social pressure or have a lack of social pressure to publish hastily created work. Consequently I will have to use multiple programming languages simultaneously at a single application and the operating system is the universal layer that allows data exchange between the components that have been written in multiple different programming languages. (I know that the JavaVM, CLR, etc. have multiple programming language "front ends", but those do not include all possible programming languages and programming language implementation specific quirks.) It seems that the scientific super-computer community has had similar problems literally since before I was born (1981).

@DemiMarie

I think that Microsoft's Midori project should also be a good source of ideas.

@playsomethingsaxman

I think that Microsoft's Midori project should also be a good source of ideas.

I think that Midori should be completely resurrected. I would never ever let a fertile garden of ideas like that die out.

@martinvahi

If by "Microsoft Midori" You mean
an experimental operating system,
then it seems to me that in 2016 the main issue
with operating systems is not the multiple CPU-support, but
fundamentally flawed packaging systems.
I just copy-paste one of my posts from the ParaSail forum:

---citation--start---

I just wanted to add that from applications software developer's
perspective the same announced version of any software component
might be compiled/built with different compile/build options. 
Any combination of the build option values is actually a
separate version of the software component. The operating
system libraries and package systems are

FUNDAMENTALLY FLAWED

by reflecting only the announced versions, by omitting
the build option based sub-versions. The set of build options
include the options that are set automatically at the
classical "configure" step.

The situation is even worsened by the fact that even
if there were no build options, even if there were only one
build setup and no configure step, then the different
announced versions of the software package might
have semantic changes that will not introduce any
syntax changes, any changes in function signatures, etc.

As You probably already guessed, those 2 issues
almost guarantee that the classical package collection
systems of all major 2016 Linux distributions is hopelessly
flawed and there is practically no chance to get a package
collection, including the set of various system wide libraries,
to a state, where it all actually works without flaws.
The countermeasure is to avoid depending on system
libraries or to build a new packaging system that treats
build option combinations as sub-versions and where all
dependencies are described by referencing a specific
version of the dependency, including the build options based
sub-version of the dependency.

Unufortunately the Nix Packaging System and the GuiX

https://nixos.org/nix/
https://www.gnu.org/software/guix/

are both fundamentally flawed from that respect.

---citation--end----

In my view there's no point of having a highly optimized
programming language, when it is used wastefully and the
wasting is exponential. To bring You an extreme example
and, I admit, to show off a little,
here's an example
how I outperform typical C++ with Ruby, PHP, JavaScript
at such a banal and "low level" task as string concatenation.
Not just a few percent, but, depending on data, at least 50%.
The C# implementation is under the MIT license.
One might just imagine the global savings on electricity,
if all the HTML, XML, JSON, SVG, etc. were assembled
by using that, more optimal, string concatenation function. :-D
The string concatenation example is related to
high performance computing by the fact that
multiplication of whole numbers is very similar
to string concatenation. I demonstrate that with
Ruby code at one of my comments at
a Ruby related bug report.
The demo is in 2 parts:
the_code.rb and
run_demo.bash.

A programming language is an important tool for software developers.
So, I, like everybody, have given it a lot of thought. My thoughts
are probably much dumber than
those of the giants, but
here's an excerpt from 2016_08 version of my notes:

---citation--start---

If the social aspects and domain knowledge related aspects 
have been taken to account, then the rest of 
the recommended programming language evaluation criteria 
might be:

Does the set of design patterns that a programming language 
has built-in support for cover 
the problem domain of a software project?

    (For example, before object oriented programming languages 
    were available, Object Orientation was used as 
    a design pattern. Another example of a design pattern that 
    has been incorporated into programming languages is 
    the Functional Programming.)

Do the programming language implementations 
have the tooling for detecting 
the programming language specific (abstraction layer specific) 
flaws and what the social process for developing/maintaining 
the tooling and price of the tooling is, 
including the dependencies of the tooling?

    (Whether a compiler detects possible flaws, 
    whether the programming languages has 
    add-on analysis tools like the C/C++/Ada have 
    in the form of "Formal Methods", 
    how and where model checking and/or 
    other means for detecting/preventing flaws are used.)

What performance optimizations does the programming language 
implementation have, including the various overheads of 
the possible runtimes like the C# CLR, JavaVM, 
Ruby/Python interpreter, etc?

    (Performance optimizations include 
    the classical topic of algorithmic complexity, 
    parallelization, code transformation based optimizations 
    that are carried out by compilers/interpreters/translators, 
    data flow analysis, RAM allocation size, 
    RAM usage locality, etc.)

---citation--end----

What regards to operating systems, then as of 2016_08
it seems that the most promising efforts seem to be the

GenodeOS, which is ready for betatesting,
and the
EU Commission financed MINIX3.
On the other hand, one of the lessons from the Debian
package collection is that a package collection and
packaging system can be independent of the operating
system. For example, the Debian package collection
is being used for totally non-Linux-non-BSD operating system,
the GNU Hurd.
I've tried the Hurd in VirtualBox and it does seem to
meet its advertisements. So, in stead of creating a
new operating system, may be a gradual development
approach, where an existing bleeding edge, MODULAR,
operating system, for example, the GenodeOS, is
accompanied by more modern versions of
operating system independent modules, in this case,
a proper packaging system, seems more doable,
realistic and yield more reusable and reliable results.

The beauty of the more gradual and modular approach
is that if the architecture fails, some new knowledge
shows that the whole thing is built wrongly, then a lot
of the previous work can be re-used, existing modules
can be changed individually, in parallel by different teams,
some re-written from scratch, and then re-assembled to
form the new system. The examples of operating
system component re-use are the file systems
and network stacks: new operating systems
do not need to implement their own file systems.
(Probably only psychologists can explain,
why the people at Microsoft implemented their
own, not-so-good, file systems in stead of just
using some proper open source file system.
The Microsoft filesystem access rights model might
have been mapped to the POSIX model, if they
really wanted to.)

@svick
Contributor
svick commented Aug 14, 2016

@martinvahi One might just imagine the global savings on electricity, if instead of complicated custom string concatenation algorithms, people used StringBuilder, or, even better, string.Concat(). 😉

@martinvahi
martinvahi commented Aug 14, 2016 edited

@svick Generally speaking, in the case of shorter strings I agree with Your last statement about the StringBuilder style of implementations, but please run the tests for longer strings that are assembled from a huge amount of small strings. The "StringBuilders" do need to allocate memory and that's where the difference comes from. The proposed watershed string concatenation algorithm gets its speed increase by reducing the amount of memory that is allocated for temporary strings. Big strings fall out of CPU cache, so the more CPU gets its work done by working with small strings that fit into the CPU cache, the quicker it gets its job done. I also understand that from electricity consumption point of view the CPU-s might go to energy saving mode, when waiting for the RAM to answer, but I do not know, if the CPU-s have such a fine grained energy consumption control.

What I do know is that any transfer of information by using electrical signals at non-superconducting wires has energy costs that come from the current that is used for charging/discharging the electrical capacitance of the bus. The longer the line, (CPU-2-RAM versus CPUcache-2-CPUregisters), the greater the capacitance.

A bit out of topic comment from me, but as it turns out, superconductor based logic circuits are not a novel idea. The field transistors seem to work by using electrical fields anyways, that is to say, those do not depend on the fact that the wires generate heat and have electrical resistance. If heat is never generated, which is the case when everything is superconductive, then it is OK for the superconductor based logic circuit to use huge currents. It's also somewhat OK for the circuits to be "slow" by the 2016 standards, because if they do not generate any heat, then they do not need any radiators or heat dissipation, which means that it is possible to stack a lot of them in a relatively small space and compensate their slowness with massive parallelism. Hospitals use MRI-machines that keep their superconductors cool. Therefore the technology to keep superconductors super-conducting cheaply enough for a server farm is already commercially available. Hospitals do not use any special, huge, air-conditioners at their MRI-scanner rooms or if they do, then it's only to compensate for the heat that leaks through the walls of the superconductor containment vessel. For a server those leaks could be minimized by making the vessel a sphere. At first glance, all what is needed to create a superconductor based server is to figure out, how to create miniature field transistors from superconductors. I do not know the details, but to me it looks pretty doable to any small lab that has the equipment for producing even year 2005 era prototypes, id est some outdated microchip prototyping equipment and chemistry related experimentation is probably all, what it takes to create such a server. The liquid helium (nitrogen is cheaper) is standard stuff that is available or at least orderable/creatable at any proper physics lab. The first technical obstacle that comes to my mind is that the huge varying magnetic fields within the circuit might introduce currents, interference, but that might be dealt with by designing special shielding into the microchip. It's just a wild guess.

Thank You for reading my comment.

@andre-ss6
andre-ss6 commented Aug 27, 2016 edited

Having seen the features that made the final cut onto C# 7, I think what we now need is a very performance-focused update. C# is already a very elegant and very pleasant to work language. That's why we use it instead of C++ (or even Java), even since the Windows-only days. Now that C# is open-source and runs on multiple-platforms, I see absolutely no reason to leave it but performance. And every single time I go back to C or especially C++ to work on some performance-critical code, I remember how much I love C#.

And I correct myself: earlier in this discussion, I said that compiler optimizations were more important and should come before anything else. I now disagree with my former self (I think that, in that time, I was working on a specific problem which exacerbated some of the lack-of optimizations in the compilers).

Looking now into the problem in a more holistic way, I ended up agreeing very much with a lot of what @xoofx has said. I think that, although we would benefit from a lot of changes in a lot of different places, the best we can do right now, the one thing we should focus on first, is to provide ways to make C# allocate less [on the heap].

Look at ASP.NET Core (there's a wonderful wonderful talk on this by the asp.net guys: https://vimeo.com/172009499). They've had to go to unspoken lengths in order to reduce the number of allocations on Kestrel so they could compete with other web frameworks in terms of performance. They've written some code that is [arguably 😛 ] as hard to decipher as a cryptographic text (no offense meant asp guys); it was literally written to be easy for machines to read, not humans. Inside .NET Core, everywhere you look, the solution to performance-related issues at least includes (when it is not the solution itself entirely) writing [crazy] code that reduces allocations. One dev commented somewhere here that they're trying to come up with a new API or something that allocates less in order to improve performance for Linq (and other things). When will this stop? Will we have to re-write the entire BCL [in that cryptographic idiom] in order to make up for the mistakes of old ( = writing code as if allocations were magically free)?

Coming up with a RC mechanism will not solve this. It may, at best, alleviate it slightly. Even other suggested solutions fall under the same question: if the GC is the problem, why not make it so we have to worry less about it, instead of more? Escape analysis is a good example of something that would bring the best of both worlds: by allocating on the stack, we not only (1) allocate faster; and (2) puts zero pressure on the GC, but we also keep things working as they currently do.

There's also the suggestion from @xoofx: a renewed stackallock operator in conjunction with a new transient one. I find his investigations very enlightening. Something like this would potentially reduce GC pressure by a reasonably good amount while also improving data locality.

Sebastian Sylvan talks about this in his blog. One of the things he mentions is a uniformly slow program, and that has happened very recently to me. I tried optmizing a simulator I was working on and after a few changes (first an algorithmic one, then reducing some allocations, remove some Linq code, stop using events) I ended up with a report showing no hotspots, although the program still had a lot of room for optimizations. The problem was that everything allocated. If I were to optimize the code further by a perceptible amount, I would have to rewrite a lot of things and use that allocation-free wizardry.

There have been some really great suggestion here and elsewhere to improve C# performance. However, I think it is now time to make a really performance-focused update. We're approaching a place where making performance improvements to C# as a bonus amidst several other updates will potentially make the language worse. Just take a look at this thread and see how many suggestions were made to improve performance by introducing changes that make it either easier or possible to work with code trickeries or that improve performance only in very specific cases. I don't think we should be suggesting band-aids; No, I think we should take a holistic view on C# current performance problems and try to come up with durable, well-thought solutions. The first thing I think should change in everyone's mindset, especially MS devs (if it hasn't already; also, no offense meant in any way) is the thought that allocations aren't obscenely costly [in the context of high-perf code] - they are! As Sebastian correctly noted, C# endorses allocations. It desperately wants you to allocate. And the .NET devs seems to have historically embraced that. And I don't blame them: who wants to fight the language? Compare this with C, where the language itself doesn't even support heap allocation, and you see the difference. We need to see the GC and RAM as very bad guys and we need to focus on building solutions to overcome them instead of focusing on improving our underground escape-route tunnels that we use to hide from them. Whomever wants to keep their trickery should be allowed and rewarded for doing so; that should not, however, be the only way to write high-performance code on C#.

Now, all this sounds very good and happy. Most of the solutions that come from this way of thinking, though, are generally met with a lot of friction because they introduce language breaking changes and/or CLR changes. I think that is a pity.

Also please don't misread my observations as "drop every other suggestion this is the exclusively important one". Even though, as I said, I disagree with my former self on whether compiler optimizations should come first, I still think that they should come sometime. There are tons of good suggestions out there. My post is not specifically about them in any way.

Sorry for the long post and thanks for reading.

TL;DR: I think C# 8 should have a very big focus on performance-improvement, creating solid solutions instead of band-aids and that we should be willing to potentially make changes wherever needed, even breaking ones .

@benaadams
benaadams commented Aug 28, 2016 edited

C# heap allocations aren't expensive and are faster than C/C++. They are even faster than stack allocations (if the stack alloc is too large to be a register) as the stack needs to zero out the memory first; whereas the heap has done this ahead of time. However, they do come at a cost, which is deallocation.

C/C++ have both an allocation and deallocation cost. The main advantage is you know when you are paying that deallocation cost; the main disadvantage is you often do it at the wrong time or don't do it at all which leads to memory leaks, using unallocated memory or worse trashing something else's memory (as it has been reallocated).

The GC isn't the bad guy; its the liver and blaming it is like blaming a hangover on the liver.

However; the free drinks the clr allocation bar gives you are very tempting, and does mean there is a tendency to over indulge so the hangovers can be quite bad... (GC pauses).

In lower throughput code this isn't too much of an issue, maybe some social drinking; but in high throughput code this is like a tequila shots competition on a bachelor party and one that happens every day! The GC does a great job; however we have a tendency to make it do a huge amount of work.

One way to resolve this; without requiring people to change how they code to avoid allocations would be escape analysis and stack allocation where the objects don't escape the call chain (swapping some drinks for water).

It still wouldn't resolve everything, like medium lifetime objects. However with less being put on the heap, this would mean there would be less GCs; which would mean less object promotion in the GC generations so you could get away with more. However some things would still need to be object pooled - but there would be less need for it.

If in function the stack allocation could be scoped then it could also be reused (like allocating in a for loop); then that would play nice with the cpu cache; as you'd keep using the same glass rather than getting a fresh one.

Hmm... I may have taken the drinks metaphor too far... 😸

@andre-ss6

Yea, I am aware that allocations on C# heap are faster than, say, a malloc. However, by cost of allocation I meant exactly the cost of having to GC them later and also the cost of cache-misses. I also agree with you that the GC itself is not a villain, but I don't think that drinks analogy is a good one :p. One drinks as much as he wants and he doesn't necessarily needs to. On the other hand, C# seems to be designed around the idea that "the GC can handle everything for us, so why even bother?", forcing (being very careful on using this word haha) us to allocate.

I find suggestions like Slices, stack allocation of classes and deterministic destruction very interesting and I think that they serve as evidence of all that I said.

All this is coming from a mostly empirical and somewhat anecdotal pov though. I'm no expert in GC or CLR's optimizations, not even by far. So I may be totally wrong of course. :P

@DemiMarie
DemiMarie commented Aug 29, 2016 edited

The following might be good ways to reduce allocation:

  • Interior references. That means references to the interior of objects. This requires special support in the CLR (specifically the GC).
  • Change the C# code to take generic arguments instead of interface arguments:
void ShouldNotAllocate<T, U>(T enumerator) where T: IEnumerable<U> {
    foreach (var i in enumerator) {
        DoSomething(i);
    }
}

Unfortunately, when I tried this C# was unable to infer the generic parameters in many cases.

  • Ref delegates (stack allocated, not heap allocated, but with a lifetime limited to the current stack frame).
  • Thread-local GC in the young generation.
  • Escape analysis.
  • Indirect branch removal by inlining delegates passed as arguments, thereby creating specialized versions of the function.
  • Whole-program optimizing compiler, operating on CIL and producing LLVM. Use optimizations such as 0CFA to identify actual targets of call sites. Perform region inference. Identify which data cannot escape. Identify data that only escapes along some code paths, and set up the data so that it only is allocated along those code paths.
  • Escape analysis: See above, but without necessarily having the entire program available.
  • Provide struct versions of Task<T>.
  • Make ValueType.Equals and ValueType.HashCode JIT intrinsics that are automatically generated per-type.
@Joe4evr
Joe4evr commented Aug 29, 2016
  • Generics that use type instantiation (C++/Rust/D) rather than type erasure (Nim)

What? This isn't Java. .NET generics are still strongly typed at runtime (which is why the CLR had to be updated to support them back in the day). They may not be pre-compiled like C++ templates, but here's a good read about the crazy hoops that the JIT does to optimize generics already.

  • Provide struct versions of Task<T>.

Already being worked on.

  • Make ValueType.Equals and ValueType.HashCode JIT intrinsics that are automatically generated per-type.

I think you meant GetHashCode() there, and I was told that the compiler already does that one for value types (though I'm not certain).

@benaadams
benaadams commented Aug 29, 2016 edited

Interior references. That means references to the interior of objects. This requires special support in the CLR (specifically the GC).

I think this is ref returns? which is coming

Indirect branch removal by inlining delegates passed as arguments, thereby creating specialized versions of the function.

Does it already for structs (not classes which use shared, though strongly typed code) see this comment in Vectors for patterns

Provide struct versions of Task

Already exists as ValueTask which can be awaited; and returning from a function that uses async rather than a split path function is coming in arbitrary async returns

@jcdickinson

@DemiMarie check out slice for interior references under the existing CLR. Ref returns use the same functionality AFAIK.

While arbitrary async returns are awesome, it's still only 50% of the solution - you need arbitrary async parameters, too. Either IAsyncTask, IAsyncTask<T> need to be added, or generic constraints need to be extended to support conventions-based interfaces (e.g. where T: async<R>).

@martinvahi

@orthoxerox

There are languages like Rust that try to maximize correctness without run-time overhead, so it's not one vs the other.

May be it helps, if I mention that according to Андрей Николаевич Терехов video series at YouTube he and his colleagues developed a Cold War era computer for the then-brand-new Soviet nuclear missiles and, supposedly, they got their inspiration for the missile computer from a project, where they had to build a computer for their university. The story is related to the current C# thread by the following series of observations/statements/claims:

The military grade then-supercomputer that the Soviet authorities allocated for his university after about 10 years of lobbying was unusable, because if the air was too dry, then there were electrical discards and if the air was too humid, the cooling system condensed the water, which in turn created short-circuits. Supposedly at that time that was the best military grade computer that the Soviet computer industry had to offer. The military kept those running by using a lot of slave labor in the form of conscripts, but his university did not have that kind of resources. His only option to get a computer for his university was to build his own.

To minimize the cooling power of the required cooling system, which condensed water, the electrical power consumption of the computer had to be reduced.

To minimize power consumption, NOT NECESSARILY THE COST OF THE COMPUTER, just power consumption, the number of electrical components within the computer had to be minimized.

To minimize the number of electrical components within the computer the architecture of the computer had to be optimized. One of the optimizations was to eliminate as many hardware based checks as possible and use formal methods at the compiler to guarantee that the conditions that the hardware checked for, never occur. (The way I understand, all hardware based exceptions, "CPU traps", were eliminated.)

One of the hardware blocks that was thrown out was memory access rights enforcement. Supposedly that was an intentionally chosen compromise that was made with full knowledge that the compromise solution eliminates reliable support for the C programming language. The compiler implemented Ada and Ada was supported without any compromise.

One of my personal takeaways from that story is that by putting as much as possible to the compiler side, greater number of parallel CPU-s with the same amount of electrical components can be created and all of that without increasing power consumption. The electrical components that consume power would be doing useful work, calculations, in stead of performing checks for conditions that the compiler can guarantee never to happen. Add to that the concept of systolic processors, also know as "systolic arrays", and whoever comes up with a programming language that is good at that kind of hardware architectures, wins the hearts of scientists, game engine developers, embedded systems developers, signal processing software developers.

The business software people, specially the various Microsoft divisions, have demonstrated for decades that they do not really care about efficiency, which means that the new type of hardware can just run a, comparatively speaking, slow virtual machine for whatever fancy programming language that the business software developers prefer. If that sounds harsh, then I add that I love Ruby, I do most of my code in Ruby or other interpreted languages (JavaScript, PHP, even Bash). I've done speed optimized C++ for image processing for years, so I choose, where I use what and with that speed optimized C++ background I say/write here that for many things the roughly 1000 (thousand) clocks per single Ruby basic command is plenty for me, even on old models of 32bit Raspberry Pi.

Another note about the C versus Ada example is that may be the C# should be "updated" by choosing a narrow sub-set of it that can be compiled to run on the limited, extremely parallel, systolic CPU-s, and the rest of the C# might be used for general programming, where software developer comfort is the primary goal, no matter, how costly. The costly part might run on the slow virtual machine. A single C# application might use highly optimized libraries as sub-components. The highly optimized components would use only the strict sub-set of the C# language. That way the high performance C# application code can be easily glued to the application code that uses the comfort-first-style-of-Csharp components. The developers of either branch of the C# language can then optimize their branch without being constrained by the requirements of the other branch. That idea is not novel at all and is being used at JavaScript circles. If a JavaScript code has been written by adhering to certain rules, then certain types of optimizations are applied. JavaScript application developers have to be aware of those rules, if they want to have those optimizations from the JIT.

An Out-of-Topic side-note is that this planet is interestingly small. In a private letter to me the Андрей Николаевич Терехов stated that during the Soviet Union times he got his academic degree from Tallinn, my home town, Tallinn University of Technology, and that he was supervised by Enn Tõugu, the founder of Computer Engineering branch at the then Soviet Tallinn University of Technology. I admit my stupidity and say that it is exactly thanks to that letter of the Андрей Николаевич Терехов that I know, who Niklaus Wirth as a pioneering computer scientist is. I think that the reason, why the newest programming languages of the Niklaus Wirth failed, with the obvious exception of the "older" Pascal, is that Niklaus Wirth did not pay attention to the inter-operability of programs that have been written in different programming languages and he also created those programming languages at an era, where internet did not exist and marketing to early adopters, the brightest bunch of the people, who are interested in trying out bleeding edge academic results, was too expensive for an academic like him to afford. It's pretty hard to reach that audience even in 2016 and just imagine, what it would be like at the time, when Niklaus Wirth was most active at creating new programming languages.

Thank You for reading my comment.
I at least tried to be helpful. :-)

@HaloFour

@martinvahi

Rather than trying to take a general purpose programming language like C# and trying to shoe-horn it into a very constrained and specific space such as highly automatically concurrent programming, why don't you actually use one of the various languages designed for that job? It's not like any of those languages let you write whatever you want and it's magickally concurrent/faster. What those languages do is force you onto the rails where the program could be parallelized. But if the tasks aren't sufficiently independent you won't realize any benefits regardless of the language or compiler. The act of being concurrent adds a lot of overhead, regardless of the language, OS, hardware, etc.

Try F# with Akka if you want something CLR-based.

@SunnyWar

We have had for years "solvers", such as the Microsoft Solver Foundation that can find optimal or near optimal solutions to a large variety of problems. I hope something like this already exists in the compiler. If not, perhaps it should be. I bet a there are a large number of decisions the compiler makes can be expressed as a Model with Parameters, Decisions, Goals[{Minimize|Maximize}], and Constraints. I also bet the solutions a solver comes up will be as good or better than anything a human has hardcoded.

@martinvahi

@HaloFour

Rather than trying to take a general purpose programming language like C# and trying to shoe-horn it into a very constrained and specific space such as highly automatically concurrent programming, why don't you actually use one of the various languages designed for that job?

Thank You and the @SunnyWar and others for the answers.

My last proposal was that instead of doing the 2 parts, speed-optimized part and development-comfort optimized part, in 2 different programming languages, the 2 parts might be written both in C#, except that the development-comfort optimized part can use all of C# and the speed-optimized part is allowed to use only a sub-set of C#. The benefit of such a solution over the Ruby-gems-written-in-C is that the development-comfort optimized part will become even more comfortable due to the possibility to avoid writing all the glue that it takes to glue together components that have been written in different programming languages. The speed-optimized part will probably be error prone and laborious to write no matter, what programming language and automation is used. As a matter of fact, the solution is even modular in a sense that different speed optimizations might allow the use of different sub-sets of the C# language. Developers of speed-optimized parts might have to choose, which type of auto-optimizer they want to use and then write their speed-optimized part by using only that subset of C# that is supported by the optimizer that they want to use.

This also allows experimentation at the academic side and gradual migration of the academic results to practice.

@HaloFour
HaloFour commented Sep 1, 2016

@martinvahi

The fact that you conflate "speed-optimized" with parallelism highlights the fact that you don't understand the problem space. They're not the same thing and they're not written in the same manner. Such a "dialect" of "C#" would be so fundamentally different from C# that it simply wouldn't be C#. For reference, I suggest you look into Microsoft Research language Axum.

What you suggest simply doesn't make sense. The languages and tools to write parallel programs already exist, including on the CLR. Microsoft has no reason to fork off a dialect of C# just to hack it up into something unrecognizable from C#. You may feel free to give it a try, though.

@antiufo
antiufo commented Sep 15, 2016

Shameless plug: in case someone is interested, here's a tool I recently published, roslyn-linq-rewrite, that rewrites the syntax trees of LINQ expressions and turns them into allocation-free, procedural code, by setting a custom compilerName in project.json.

@playsomethingsaxman
playsomethingsaxman commented Sep 15, 2016 edited

It's gotten clearer and clearer to me that most of C#'s performance problems are really framework performance problems. And it is far easier for me to rewrite the remaining parts of the .NET Framework for my competitive advantage than to complain to website people that Stream async methods need buffer pointer overloads.

It's strange that many of the same ones who defend the most awkward points of C#'s "striving for correctness" usually see nothing wrong with using signed integers for 0-based absolute pointers. It feels like being on an MMO forum watching the debate between crafters and raiders.

@chrisaut

@antiufo this looks very very nice (disclaimer, I haven't actually tried it).

But I fear it will reach a very limited audience being maintained as a separate tool. Do you have any plans to propose/integrate all or parts of this into the "mainstream" compiler here?

Anything you learned from it that would suggests it cannot/should not be done in the general case?

@antiufo
antiufo commented Sep 16, 2016

@chrisaut based on previous discussions, there seem to be three options:

Escape analysis only (JIT)

  • Func<> and <>_DisplayClass allocations could be made on the stack
  • The price of virtual calls (GetEnumerator, Func<>.Invoke()) would still be paid, no inlining
  • Type checks for ICollection and other special cases would still be paid

LINQ calls are optimized by the JIT

An [Intrinsic] attribute could be applied to Enumerable.Where<>, Select<> and so on. When the jitter sees calls to these methods, it could produce optimized procedural code based on its knowledge of how Where, Select methods should behave. By adding [Intrinsic] to your Where-like method, you are guaranteeing that this is indeed the behavior you want (#2385).

  • The IL code is still going to contain things like new <>DisplayClass. The jitter would have to collect all of these details and determine how to output the correct procedural machine code (it cannot rely on the syntax trees that roslyn is instead aware of). This means that the implementation could be quite ugly/complicated.

LINQ calls are optimized by the compiler

When the roslyn compiler sees a chain of calls to methods marked with [Intrinsic] and well-known names, it could produce optimized procedural code on its own.

  • Old assemblies would not benefit from this, they have to be recompiled.
  • Changing the Where/Select implementation would require recompiling the assemblies.

While I see that this last point might seem a bit problematic, this is not the first time we sacrifice the freedom to retroactively change the implementation of some very-straightforward code for the sake of performance (like [NonVersionable] properties/methods).
While the current implementation of many LINQ methods is non-trivial and could change in the future, most of the complexity is due to optimizations, special cases, type checks, management of IEnumerators.

After all, it's not unreasonable to think that the procedural implementation of a Where+Count over a non-null array is always going to be

var count = 0;
for (var i = 0; i < arr.Length; i++) // or foreach
{
    if(condition...arr[i]...) count++;
}

The code produced by the compiler might change slightly across compiler versions, but this already happens with other language features.

Of course, there's no need to optimize all LINQ methods. Complex ones like GroupBy could still be compiled in the traditional way (closure+func+call).

@torgabor

@antiufo Just my 2 cents regarding the 3 possible solutions: I think the 3rd solution is the best, for the following reasons:

  • As you stated, solution 1 doesn't solve the problem completely, however with even more compiler machinery in the JIT(stack allocation, devirtualization, inlining),the constructs could be reduced to a soup of plain instructions the cost could be eliminated. However this requires a lot of development effort, and it's not clear to me that this is feasible.
  • Solution also seems far from trivial, and also a bit inelegant, due to how lambdas are implemented. In the following code(which is not LINQ but would be candidate for a similar optimization):
            int x = 2;
            Action lamda = () => x++;
            lamda();
            Console.WriteLine(x);

The compiler moves out x from the stack onto a display class, and Console.WriteLine gets passed the field of the display class as a parameter, etc.. The compiler would need to recognize these patterns and transform them back to stack allocation, and loops. So in this case the compiler would apply source transformations that actively work against us, that we have to undo in the JIT.

  • Solution 3' only disadvantage I think, is that it could change the semantics of the source (which is obviously should not be allowed) if the implementer is not careful.
@jcdickinson

Concerning @antiufo's scenario 1 (I really don't like [Intrinsic]), I had an attempt at trying to get the JITter to do the "right" thing with Where(), by implementing the whole lot using structs. The disassembly (4.6.1, x64, release) shows multiple call ops, indicating that the desired inlining is not occurring. Using an array instead of IEnumerable seems to improve things, but gets nowhere near the desired results (and still has multiple call ops in the foreach block).

@xen2
xen2 commented Sep 21, 2016 edited

Solution3, if implemented, would need lot of tedious manual work, and would work only on LINQ. It would feel strange that certain method optimizations are hardcoded in Roslyn, while a user wouldn't be able to do the same with his own code?
Of course, making it extensible is possible (compiler plugins), but that would be huge complexity, lot of chance of mistakes in each LINQ method optimizer, etc... so I would rather avoid this approach. Or maybe it should be quite generic, like "IL" optimizer/passes plugin that could do any kind of optimizations, like LLVM? (I would be happy with that, as we already have to do IL post processing to various extents in our game engine)

Solution1 seems much better to have in general (lot of code coudl benefit from it, and should be "easier" to prove and make safe in general), and it seems some effort is already started in that direction, cf https://twitter.com/xjoeduffyx/status/771023674694524932

@mattwarren

Joe Duffy has a great talk that covers (amongst other things) what would need to be done to optimise LINQ, escape analysis, stack allocation etc (slides are available if you don't want to watch the whole thing)

@msedi
msedi commented Sep 22, 2016 edited

I have a general question about for loops. I was trying to optimize my mathematical routines which are mostly done on arrays of different types. As discussed very very often here the problem is that I cannot have pointers to generics so I had to duplicate my functions for all primitive types. However I have accepted - or better resigned - on this topic since it seems it will never come.

Nevertheless I have also tried the same - also discussed here - with IL code which works fine for my solution, but also there it would be nice to have some IL inline assembler, like the old asm {} keyword in C++, also here I guess it will never come.

What does currently bother me is when looking how a for loop is "converted" into IL code. From my old assembler knowledge there was the LOOP keyword where a simple addition was done in the AX,BC with the CX as count register. In IL it seems that all loops are converted to IF...GOTO statements which I feel very umcomfortable with, since I think no jitter will ever recognize that an IF...GOTO statement can be converted to the LOOP construct in x86 architecture. I guess that doing the loops with IF...GOTO costs much more than the x86 LOOP. What does the jitter do to optimize loops?

I'm I right or wrong on this topic.?

@bbarry
bbarry commented Sep 22, 2016

@msedi by building all loops in IL roughly the same way the jitter can search for a common pattern to optimize. Indeed the core CLR (and I assume desktop as well) does identify a number of such possible loops. For example:

https://github.com/dotnet/coreclr/blob/393b0a8262e5e4f1fed27494af3aac8778616d4c/src/jit/optimizer.cpp#L1195

Try to find loops that have an iterator (i.e. for-like loops) "for (init; test; incr){ ... }"
We have the following restrictions:

  1. The loop condition must be a simple one i.e. only one JTRUE node
  2. There must be a loop iterator (a local var) that is incremented (decremented or lsh, rsh, mul) with a constant value
  3. The iterator is incremented exactly once
  4. The loop condition must use the iterator.
@svick
Contributor
svick commented Sep 22, 2016 edited

@msedi Apparently, LOOP has been slower than jump since 80486.

And finding loops is easy for the JIT, you just have to find a cycle in the control flow graph generated from the IL.

@msedi
msedi commented Sep 22, 2016

@bbarry && @svick : Thanks for the explanations. That helps.

@andre-ss6

Wonderful talk by Joe Duffy. I felt happy to hear that they're [apparently] tackling all those problems we're discussing here.

And geez, I was at least impressed to hear that some applications from Microsoft (!) are 60% of the time in GC. 60%!! My god.

@rstarkov
rstarkov commented Sep 22, 2016 edited

@andre-ss6 hits the nail on the head. Of course not all performance issues are due to allocations. But unlike most performance issues, which have sane solutions in C#, if you run into 99% time spent in GC then you're pretty much stuffed.

What are your options at this stage? In C# as it stands today, pretty much the only option is to use arrays of structs. But any time you need to refer to one of those structs, you either go unsafe and use pointers, or you write extremely unreadable code. Both options are bad. If C# had AST macros, the code to access such "references" could be vastly more readable without any performance penalty added by the abstraction.

One of the bigger improvements on code that's already well-optimized comes from abandoning all the nice and convenient features like List<T>, LINQ or the foreach loop. The fact that these are expensive in tight code is unfortunate, but what is worse is that there is no way to rewrite these in a way that's comparable in readability - and that's another thing AST macros could help with.

Obviously the AST macros feature would need to be designed very carefully and would require a major time investment. But if I had a vote on the subject of the one single thing that would make fast C# less of a pain, AST macros would get my vote.

P.S. I was replying to Andre's comment from almost a month ago. What are the chances he'd comment again minutes before me?!

@agocke
Contributor
agocke commented Sep 22, 2016

@rstarkov Hmm, I would object to calling a codebase that's using LINQ "well-optimized." That's basically saying, "I'm not allocating anything, except for all these allocations!" :)

@SunnyWar
SunnyWar commented Sep 23, 2016 edited

I'm happy to see the ValueTask. I hope they make it into the Dataflow blocks. I wrote a audio router a few years ago. After profiling I found it spent most of it's time in the GC cleaning up tasks....and there was nothing I could about it without completely throwing out the Dataflow pipeline (basically the guts of the whole thing).

@benaadams

What are your options at this stage? In C# as it stands today, pretty much the only option is to use arrays of structs. But any time you need to refer to one of those structs, you either go unsafe and use pointers, or you write extremely unreadable code. Both options are bad.

@rstarkov you use Ref returns and locals in C# 7 with Visual Studio “15” Preview 4, though alas you can't use it will .NET Core currently. However, it is coming and should address this particular issue.

@svick
Contributor
svick commented Sep 23, 2016

@SunnyWar ValueTask makes the most sense when a method sometimes completes synchronously and sometimes asynchronously (in which case you don't have to allocate in the synchronous case). Not sure if using it would solve some general issue in Dataflow.

Were your transforms synchronous or asynchronous? If they were asynchronous, then you probably can't avoid allocating Tasks. If they were synchronous, then I'm not sure why would Dataflow allocate lots of Tasks.

@agocke
Contributor
agocke commented Sep 23, 2016

@benaadams Technically the NuGet packages are available, but it'd probably require building the .NET CLI repo from source

@benaadams

@agocke compiling it is one thing and important for CI; but development work doesn't flow so well when the UI tooling doesn't understand it very well and highlights errors :-/

@agocke
Contributor
agocke commented Sep 23, 2016

@benaadams Duh, I totally forgot about the IDE :-p

@bbarry
bbarry commented Sep 23, 2016

@SunnyWar, @svick also if you aren't careful in dataflow you can wind up with many allocations related to closures, lambdas and function pointers even if they were synchronous (it seems pretty impossible to avoid at least some in any case; sometimes it might even be reasonable to hold on to references intentionally to lighten GC in particular places).

@jcdickinson

@rstarkov ArraySegment<T> (essentially a slice) and a few home-grown extension methods can help with that, although not perfect. Also, have a look at Duffy's Slices.

@timgoodman

@agocke The fact that a codebase which uses LINQ is not "well-optimized" is exactly the problem. There's no reason in principle why, at least in the more simple cases (which are probably the majority of cases), the compiler couldn't do stack allocation, in-lining, loop fusion, and so forth to produce fast, imperative code. Broadly speaking, isn't that (a big part of) why we have a compiler - so we can write expressive, maintainable code, and let the machine rewrite it as something ugly and fast?

Don't get me wrong, I'm not expecting the compiler to completely free me from having to optimize, but optimizing some of the most common uses of Linq2Objects seems like relatively low-hanging fruit that would benefit a huge number of C# devs.

@timgoodman

@mattwarren That Joe Duffy talk is amazing, thanks for sharing! To what degree is this work already in progress with the C# compiler, as opposed to just in experimental projects like Midori? In particular, the stuff he's talking about at around 23:00 seems a lot like what people here are asking for as far as LINQ optimizations. Is there an issue in this GitHub repo that tracks the progress on that?

@benaadams

@timgoodman there are things here and there dotnet/coreclr#6653

@timgoodman

@benaadams Thanks. I guess I'm not sure why this sort of thing would be under coreclr. The kinds of changes that Joe Duffy was describing seem like compiler optimizations - shouldn't they belong in roslyn or maybe llilc?

@timgoodman

Ah, never mind, I hadn't realized that the coreclr repo contains the JIT compiler. I guess that's where this sort of optimization would need to happen for it to apply to calls to System.Linq methods.

@ciplogic
ciplogic commented Jan 2, 2017

Great effort!

Sounds to me a bit silly to point it out, but I've noticed in the codebase only one register colorizer: LSRA (Linear Scan) one.

Is it possible to set at least for flags like AggresiveInline to a different register allocator? Maybe BackTracking (the LLVM new one) or a full register allocator?

@ciplogic
ciplogic commented Jan 2, 2017

It would be great to be minimal CHA or at least for sealed classes to be devirtualized or internal classes in assembly that are not overriden to be considered sealed. Use this information to devirtualize (more aggresively) methods.

Very often using ToString, and so on, cannot be safely devirtualized because there is the possiblity that the methods to be overriden. But in many assemblies private/internal classes are easier to be tracked if they are overriden, especially as assemblies do have the types and relations local.

This operation should increase by a bit the starting time, but it could be enabled into a "performance mode" tier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment