State / Direction of C# as a High-Performance Language #10378

ilexp · 2016-04-06T19:47:10Z

I've been following recent development of C# as a language and it seems that there is a strong focus on providing the means to write code more efficiently. This is definitely neat. But what about providing ways to write more efficient code?

For context, I'm using C# mostly for game development (as in "lowlevel / from scratch") which has a habit of gladly abandoning the usual ways of safe code design for that 0.1% of the bottleneck code in favor of maximum efficiency. Unfortunately, there are cases where C# gets in the way of that last bit of optimization.

Issues related to this:

Ref returns / ref locals: Issue Proposal: Ref Returns and Locals #118
Slicing: Issue Proposal: Slicing #120
Array views of blittable data types: CoreCLR Issue 1015
Generic API for unsafe read / write: CoreFx Issue 5474
Support for SSE4 Intrinsics: CoreFx Issue 2209
Efficient unmanaged memory operations: CoreCLR Issue 916
Vector shuffling SIMD operations: CoreFx Issue 1168
Handling overlapped explicit FieldOffsets in structs: Issue Definite assignment versus StructLayout(LayoutKind.Explicit) #10319 and CS0171 should be smarter when checking structs with LayoutKind.Explicit #7323
Extended JIT time constants: CoreCLR Issue 2591
Extended Compile time constants: Issue Compile time constant some reflection values #10972
Extended unsafe generics: Issues Make Getting sizeof(T) where T : struct Possible #3208 and Making Declaration of T* where T : struct Possible #3210
PrimitiveValueType and Generic Pointers: Issue Proposal: PrimitiveValueType and Generic Pointers #2209
Custom memory allocations: CoreCLR Issue 1235
There are probably more - please share them in a comment

Other sentiments regarding this:

The only way to improve / guarantee memory locality right now seems to be putting the data into an array of structs. There are scenarios where this is a bit impractical. Are there ways to handle this with classes?
Language support for object pooling / limited control over what exactly "new" does for a given class / reference type.
A way to instantiate a reference type "within scope", making cleaning it up more efficient by somehow providing the GC with the extra knowledge about an explicit / intended lifespan.
Please share your own in a comment

This is probably more of a broader discussion, but I guess my core question is: Is there a general roadmap regarding potential improvements for performance-focused code in C#?

JoshVarty · 2016-04-06T20:28:45Z

A way to instantiate a reference type "within scope", making cleaning it up more efficient by somehow providing the GC with the extra knowledge about an explicit / intended lifespan.

I believe this was also requested in #161

HaloFour · 2016-04-07T13:23:03Z

The only way to improve / guarantee memory locality right now seems to be putting the data into an array of structs. There are scenarios where this is a bit impractical. Are there ways to handle this with classes?

A way to instantiate a reference type "within scope", making cleaning it up more efficient by somehow providing the GC with the extra knowledge about an explicit / intended lifespan.

I don't believe that these issues can be solved without direct CLR support. The CLR limits reference types to the heap. Even C++/CLI is forced to abide by that restriction and the stack semantics syntax still allocates on the heap. The GC also provides no facility to directly target specific instances.

I wonder how much C# could make a struct feel like a class before it crosses into unsafe/unverifiable territory. C++/CLI "native" classes are CLR structs so you don't have to deal with allocation/GC but of course the IL it emits is quite nasty.

ilexp · 2016-04-08T05:11:03Z

I've added some more related issues to the above list, which hadn't been mentioned yet.

paulcscharf · 2016-04-08T07:53:31Z

I am in a very similar position to @ilexp, and generally interested in the performance of my code, and knowing how to write efficient code. So I'd second the importance of this discussion.

I also think the summary and points in the original post are quite good, and have nothing to add at the moment.

Small note on using structs sort of like classes (but keeping everything on the stack):
I believe we can 'just' pass our structures down as ref for this purpose?
Make sure you don't do anything that it that creates a copy, and it should look like a class...
Not sure if that work flow needs any additional support from the language.

About memory locality: I was under the impression that if I new two class-objects after each other, they will also be directly after each other in memory, and stay that way? May be an implementation detail, but it's better than nothing... That being said, I've had to move from lists of objects to arrays of structs for performance reasons as well (good example would be particle systems, or similar simulations that have many small and short lived objects). Just the overhead from resolving references and having to gc the objects eventually made my original solution unfeasible. I am not sure this can be 'fixed' in a managed language at all though...

Looking forward to seeing what others have to say on this topic!

mattwarren · 2016-04-13T15:16:39Z

The only way to improve / guarantee memory locality right now seems to be putting the data into an array of structs. There are scenarios where this is a bit impractical. Are there ways to handle this with classes?

There was a really nice prototype done by @xoofx showing the perf improvements of allowing stackalloc on reference types.

SunnyWar · 2016-04-13T16:21:05Z

The only way to improve / guarantee memory locality right now seems to be putting the data into an array of structs. There are scenarios where this is a bit impractical. Are there ways to handle this with classes?

Microsoft Research many years ago experimented with using some unused bits on each object as access counters. The research hacked the heap to re-organized mostly used objects so that they ended up on the same page. He showed in a sample XML parser that C# code was faster than optimized C++. The talk he gave on it was called, "Making C# faster than C#". The researcher that developed the technique left MS and the research apparently died with him. He had a long list of other, similar improvements that he was planning on trying. None of which, I believe, saw daylight.

Perhaps this work should be resuscitated so that the promise (made in the beginning: remember how the JITer was going to ultra-optimize for your hardware??) can be realized.

Claytonious · 2016-04-14T02:27:27Z

We are in the crowded boat of using c# with Unity3d, which may finally be moving toward a newer CLR sometime soon, so this discussion is of great interest to us. Thanks for starting it.

The request to have at least some hinting to the GC, even if not direct control, is at the top of our iist. As the programmer, we are in a position to declaratively "help" the GC but have no opportunity to do so.

ygc369 · 2016-04-14T07:48:45Z

I have some ideas:
https://github.com/dotnet/coreclr/issues/555
https://github.com/dotnet/coreclr/issues/1784
https://github.com/dotnet/coreclr/issues/757
#2171
https://github.com/dotnet/coreclr/issues/1856

IanKemp · 2016-04-14T12:32:57Z

"game development... has a habit of gladly abandoning the usual ways of safe code design for that 0.1% of the bottleneck code in favor of maximum efficiency. Unfortunately, there are cases where C# gets in the way of that last bit of optimization."

C# gets in the way because that's what it was designed to do.

If you want to write code that disregards correctness in favour of performance, you should be writing that code in a language that doesn't enforce correctness (C/C++), not trying to make a correctness-enforcing language less so. Especially since scenarios where performance is preferable to correctness, is an extremely tiny minority of C# use cases.

orthoxerox · 2016-04-14T13:40:33Z

@IanKemp that's a very narrow view of C#. There are languages like Rust that try to maximize correctness without run-time overhead, so it's not one vs the other. While C# is a garbage-collected language by design, with all the benefits and penalties that it brings, there's no reason why we cannot ask for performance-oriented improvements, like cache-friendly allocations of collections of reference types or deterministic deallocation, for example. Even LOB applications have performance bottlenecks, not just computer games or science-related scripts.

svick · 2016-04-14T14:47:07Z

@IanKemp Are you saying that unsafe does not exist? C# had that from the start and it's exactly for that small amount of code where you'll willing to sacrifice safety for performance.

SunnyWar · 2016-04-14T16:13:09Z

Hey, people...try this: write a function will result in no garbage collections....something with a bunch of math in it for example. Write the exact same code in C++. See which is faster. The C++ compiler will always generate as fast or faster code (usually faster). The Intel compiler is most often even faster...it has nothing to do with the language.

For example. I wrote a PCM audio mixer is C#, C++ and compile with the .Net, MS, and Intel compilers. The code in question had no GC, no boundary checks, no excuses.

C#: slowest
C++ Microsoft: fast
C++ Intel: super fast

In this example the Intel compiler recognized that computation could be replaced by SSE2 instructions. the Microsoft compiler wasn't so smart, but it was smarter than the .Net compiler/JITer.

So I keep hearing talk about adding extensions to the language to help the GC do things more efficiently, but it seems to me ...the language isn't the problem. Even if those suggestion are taken we're still hamstrung by an intentionally slow code generating compiler/jitter. It's the compiler and the GC that should be doing a better job.

See: #4331 I'm really tired of the C++ guys saying, "we don't use it because it's too slow" when there is _very little reason _for it to be slow.

BTW: I'm in the camp of people that doesn't care how long the JITer takes to do its job. Most of the world's code runs on servers...why isn't it optimized to do so?

msedi · 2016-04-14T21:15:15Z

I completely agree with all of the mentioned improvements. These are in my opinion absolutely mandatory. Using C# in high performance applications is the right was. It makes code much easier to read if there would be at least some of the suggested improvements. Currently we have to "leave" the language to C++ or C to create things there are not possible in C#, and i don't mean assembler instructions but very simple pointer operations on blittable Data types or generics.

So not to leave the language i created unreadable Code fragments just not to use unmanaged code because then i am dependent on x86 and x64.

ilexp · 2016-04-15T05:17:21Z

BTW: I'm in the camp of people that doesn't care how long the JITer takes to do its job. Most of the world's code runs on servers...why isn't it optimized to do so?

From a gamedev perspective, it would be neat if there was a way to tell the runtime to perform extended JIT optimization using framework API.

Let's say by default, there is only the regular, fast optimization, the application starts up quickly and all behaves as usual. Then I enter the loading screen, because I'll have to load levels and assets anyway - now would be an excellent time to tell the runtime to JIT optimize the heck out of everything, because the user is waiting anyway and expecting to do so. This could happen on a per-method, per-class or per-Assembly level. Maybe you don't need 90% of the code to be optimized that well, but that one method, class or Assembly should be.

As far as server applications go, they could very well do the same in the initialization phase. Same for audio, image and video processing software. Extended JIT optimization could be a very powerful opt-in and on runtimes that do not support this, the API commands can still just fall back to not having any effect.

Maybe it would even be possible to somehow cache the super-optimized machine code somewhere, so it doesn't need to be re-done at the next startup unless modified or copied to a different machine. Maybe partial caches would be possible, so even if not all code is super-JITed yet, at least the parts that are will be available. Which would be a lot more convenient and portable than pre-compiling an Assembly to native machine code, simply because Assemblies can run anywhere and native machine code can not.

All that said, I think both allowing the JIT to do a better job and allowing developers to write more efficient code in the first place would be equally welcome. I don't think this should be an either / or decision.

xoofx · 2016-04-15T06:13:06Z

While advocating for many years about performance for C#, I completely concur with the fact that It would be great to see more investments in this area.

Most notably on the following 3 axes:

Allow to switch-on a better code gen JIT (but slower). There is high hope that this will be fulfilled by the undergoing work on LLILC, for both JIT and AOT scenarios. Note that many platforms (e.g iOS, UWP/XboxOne, PS4) don't support JIT scenarios. But It will take time to achieve even performance parity with the current JIT and there are some language/runtime constraints that could make full optimization difficult (GC statepoints, array/null/arithmetic safe checks...etc.)
Improve the language (with sometimes a proper JIT/GC support) that could help in this area. That include things listed above like ref locals, array slices, string slices... and even builtin utf8 strings... Some hacks can be done by post-processing IL and have been abused in many projects, but it would be great to have these little things available without making any IL voodoo.
Put a lot more emphasis on memory management, data locality and GC pressure
- Standard improvements like stack alloc for class, embeded class instance, borrowed pointers
- Rethink our usage of the GC, while a bit more problematic, as I haven't seen much proven models in production (things like: explicit vs implicit management of GC regions to allocate known objects to a proper region of objects that would relate in terms of locality/longevity)

Unfortunately, there are also some breaking change scenarios that would require to fork the language/runtime to correctly address some of the intrinsic weakness of the current language/runtime model (e.g things that have been done for Midori for their Error Model or safe native code for example...etc.)

svick · 2016-04-15T13:07:00Z

@SunnyWar I think there's enough enough room to optimize both code generation for math and GC.

As to which one should have higher priority, keep in mind that it's relatively easy to workaround bad performance in math by PInvoking native code or using Vector<T>. Working around bad performance due to GC overhead tends to be much harder, I think.

And since you mention servers, a big part of their performance are things like "how long it take to allocate a buffer", not "how long does it take to execute math-heavy code".

GSPP · 2016-04-15T13:09:06Z

I'm adding JIT tiering to the list of features I see as required to make C# a truly high performance language. It is one of the highest impact changes that can be done at the CLR level.

JIT tiering has impact on the C# language design (counter-intuitively). A strong second tier JIT can optimize away abstractions. This can cause C# features to become truly cost free.

For example, if escape analysis and stack allocation of ref types was consistently working the C# language could take a more liberal stance on allocations.

If devirtualization was working better (right now: not all all in RyuJIT) abstractions such as Enumerable.* queries could become cost free (identical performance to manually written loops).

I imagine records and pattern matching a features that tend to cause more allocations and more runtime type tests. These are very amenable to advanced optimizations.

OtherCrashOverride · 2016-04-15T15:47:39Z

Born out of a recent discussion with others, I think its time to review the "unsafe" syntax. The discussion can be summarized as "Does 'unsafe' even matter anymore?" .Net is moving "out of the security business" with CoreCLR. In a game development scenario, most of the work involves pointers to blocks of data. It would help if there was less syntactic verbosity in using pointers directly.

Support for SSE4 Intrinsics: CoreFx Issue 2209

This is completely useless on the billions of ARM devices out there in the world.

With regard to the GC discussion, I do not think that further GC abuse/workarounds are the solution. Instead there needs to be a deterministic alloc/ctor/dtor/free pattern. Typically this is done with reference counting. Today's systems are mutli-core, and today's programs are multi-threaded. "Stop the world" is a very expensive operation.

In conclusion, what is actually desired is the C# language and libraries but on top of a next-generation runtime better suited for the needs of "real-time" (deterministic) development such as games. That is currently beyond the scope of CoreCLR. However, with everything finally open source, its now possible to gather a like minded group to pursue research into it as a different project.

TimPosey2 · 2016-04-15T15:48:47Z

I'm doing a lot of high-perf / low latency work in C#. One thing that would be "the killer feature" for perf work is for them to get .NET Native fully working. I know it's close, but the recent community standups have said that it won't be part of v1.0 RTM and they're rethinking the usage for it. The VS C++ compiler is amazing at auto-vectorizing, dead code elimination, constant folding, etc. It just does this better than I can hand-optimize C# in its limited ways. I believe traditional JIT compiling (not just RyuJIT) just doesn't have enough time to do all of those optimizations at run-time. I would be in favor of giving up additional compile time, portability, and reflection in exchange for better runtime performance; and I suspect those that are contributing to this thread here probably feels the same way. For those that aren't, then you still have RyuJIT.

Second, if there were some tuning knobs available for the CLR itself.

GSPP · 2016-04-15T16:48:45Z

Adding a proposal for Heap objects with custom allocator and explicit delete. That way latency-sensitive code can take control of allocation and deallocation while integrating nicely with an otherwise safe managed application.

It's basically a nicer and more practical new/delete.

benaadams · 2016-04-15T16:56:08Z

@OtherCrashOverride @GSPP Destructible Types? #161

OtherCrashOverride · 2016-04-15T18:53:13Z

Ideally, we want to get rid of IDisposable entirely and directly call the dtor (finalizer) when the object is no longer in use (garbage). Without this, the GC still has to stop all threads of execution to trace object use and the dtor is always called on a different thread of execution.

This implies we need to add reference counting and modify the compiler to increment and decrement the count as appropriate such as when a variable is copied or goes out of scope. You could then, for example, hint that you would like to allocate an object on the stack and then have it automatically 'boxed' (promoted) to a heap value if its reference count is greater than zero when it goes out of scope. This would eliminate "escape analysis" requirements.

Of course, all this is speculation at this point. But the theoretical benefits warrant research and exploration in a separate project. I suspect there is much more to gain from redesigning the runtime than there is from adding more rules and complication to the language.

SunnyWar · 2016-04-15T19:10:37Z

@OtherCrashOverride I've also come to the conclusion that a reference counting solution is critical for solving a number of problems.

For example, some years ago I wrote message passing service using an Actor model. The problem I ran into right away is I was allocating millions of small objects (for messaging coming in) and the GC pressure to clean after they went out was horrid. I ended up wrapping them in a reference counting object to essentially cache them. It solved the problem BUT I was back to old, ugly, COM days of having to insure every Actor behaved and did an AddRef/Release for every message it processed. It worked..but it was ugly and I still dream of a day I can have a CLR managed reference countable object with an overloadable OnRelease, so that I can put it back in the queue when the count==0 rather than let it be GC'd.

ilexp · 2016-04-15T19:11:22Z

Don't want to detail the rest of it in this general overview thread, just regarding this specific point of @OtherCrashOverride's posting:

[...] than there is from adding more rules and complication to the language.

As a general direction of design with regard to future "efficient code" additions, I think it would be a good thing to keep most or even all of them - both language features and specialized API - hidden away just enough so nobody can stumble upon them accidentally, following the overall "pit of success" rule if you will.

I would very much like to avoid a situation where improving 0.1% of performance critical code would lead to an overall increase in complexity and confusion for the 99.9% of regular code. Removing the safety belt in C# needs to be a conscious and (ideally) local decision, so as long as you don't screw up in that specific code area, it should be transparent to all the other code in your project, or other projects using your library.

svick · 2016-04-15T19:24:51Z

@OtherCrashOverride

You could then, for example, hint that you would like to allocate an object on the stack and then have it automatically 'boxed' (promoted) to a heap value if its reference count is greater than zero when it goes out of scope. This would eliminate "escape analysis" requirements.

That would require you to find and update all existing references to that object. While the GC already does that when compacting, I doubt doing it potentially at every method return would be efficient.

SunnyWar · 2016-04-15T19:37:20Z

Today, the system imposes limitations on us that are purely historical and in no way limit how things can be done in the future.

mattwarren · 2016-09-21T09:53:30Z

Joe Duffy has a great talk that covers (amongst other things) what would need to be done to optimise LINQ, escape analysis, stack allocation etc (slides are available if you don't want to watch the whole thing)

msedi · 2016-09-22T16:44:20Z

I have a general question about for loops. I was trying to optimize my mathematical routines which are mostly done on arrays of different types. As discussed very very often here the problem is that I cannot have pointers to generics so I had to duplicate my functions for all primitive types. However I have accepted - or better resigned - on this topic since it seems it will never come.

Nevertheless I have also tried the same - also discussed here - with IL code which works fine for my solution, but also there it would be nice to have some IL inline assembler, like the old asm {} keyword in C++, also here I guess it will never come.

What does currently bother me is when looking how a for loop is "converted" into IL code. From my old assembler knowledge there was the LOOP keyword where a simple addition was done in the AX,BC with the CX as count register. In IL it seems that all loops are converted to IF...GOTO statements which I feel very umcomfortable with, since I think no jitter will ever recognize that an IF...GOTO statement can be converted to the LOOP construct in x86 architecture. I guess that doing the loops with IF...GOTO costs much more than the x86 LOOP. What does the jitter do to optimize loops?

I'm I right or wrong on this topic.?

bbarry · 2016-09-22T18:54:16Z

@msedi by building all loops in IL roughly the same way the jitter can search for a common pattern to optimize. Indeed the core CLR (and I assume desktop as well) does identify a number of such possible loops. For example:

https://github.com/dotnet/coreclr/blob/393b0a8262e5e4f1fed27494af3aac8778616d4c/src/jit/optimizer.cpp#L1195

Try to find loops that have an iterator (i.e. for-like loops) "for (init; test; incr){ ... }"
We have the following restrictions:

The loop condition must be a simple one i.e. only one JTRUE node

There must be a loop iterator (a local var) that is incremented (decremented or lsh, rsh, mul) with a constant value

The iterator is incremented exactly once

The loop condition must use the iterator.

svick · 2016-09-22T20:13:03Z

@msedi Apparently, LOOP has been slower than jump since 80486.

And finding loops is easy for the JIT, you just have to find a cycle in the control flow graph generated from the IL.

msedi · 2016-09-22T20:51:39Z

@bbarry && @svick : Thanks for the explanations. That helps.

andre-ss6 · 2016-09-22T23:07:31Z

Wonderful talk by Joe Duffy. I felt happy to hear that they're [apparently] tackling all those problems we're discussing here.

And geez, I was at least impressed to hear that some applications from Microsoft (!) are 60% of the time in GC. 60%!! My god.

rstarkov · 2016-09-22T23:12:14Z

@andre-ss6 hits the nail on the head. Of course not all performance issues are due to allocations. But unlike most performance issues, which have sane solutions in C#, if you run into 99% time spent in GC then you're pretty much stuffed.

What are your options at this stage? In C# as it stands today, pretty much the only option is to use arrays of structs. But any time you need to refer to one of those structs, you either go unsafe and use pointers, or you write extremely unreadable code. Both options are bad. If C# had AST macros, the code to access such "references" could be vastly more readable without any performance penalty added by the abstraction.

One of the bigger improvements on code that's already well-optimized comes from abandoning all the nice and convenient features like List<T>, LINQ or the foreach loop. The fact that these are expensive in tight code is unfortunate, but what is worse is that there is no way to rewrite these in a way that's comparable in readability - and that's another thing AST macros could help with.

Obviously the AST macros feature would need to be designed very carefully and would require a major time investment. But if I had a vote on the subject of the one single thing that would make fast C# less of a pain, AST macros would get my vote.

P.S. I was replying to Andre's comment from almost a month ago. What are the chances he'd comment again minutes before me?!

agocke · 2016-09-22T23:40:13Z

@rstarkov Hmm, I would object to calling a codebase that's using LINQ "well-optimized." That's basically saying, "I'm not allocating anything, except for all these allocations!" :)

SunnyWar · 2016-09-23T01:10:35Z

I'm happy to see the ValueTask. I hope they make it into the Dataflow blocks. I wrote a audio router a few years ago. After profiling I found it spent most of it's time in the GC cleaning up tasks....and there was nothing I could about it without completely throwing out the Dataflow pipeline (basically the guts of the whole thing).

benaadams · 2016-09-23T01:22:16Z

What are your options at this stage? In C# as it stands today, pretty much the only option is to use arrays of structs. But any time you need to refer to one of those structs, you either go unsafe and use pointers, or you write extremely unreadable code. Both options are bad.

@rstarkov you use Ref returns and locals in C# 7 with Visual Studio “15” Preview 4, though alas you can't use it will .NET Core currently. However, it is coming and should address this particular issue.

svick · 2016-09-23T01:23:17Z

@SunnyWar ValueTask makes the most sense when a method sometimes completes synchronously and sometimes asynchronously (in which case you don't have to allocate in the synchronous case). Not sure if using it would solve some general issue in Dataflow.

Were your transforms synchronous or asynchronous? If they were asynchronous, then you probably can't avoid allocating Tasks. If they were synchronous, then I'm not sure why would Dataflow allocate lots of Tasks.

agocke · 2016-09-23T02:01:45Z

@benaadams Technically the NuGet packages are available, but it'd probably require building the .NET CLI repo from source

benaadams · 2016-09-23T02:23:41Z

@agocke compiling it is one thing and important for CI; but development work doesn't flow so well when the UI tooling doesn't understand it very well and highlights errors :-/

agocke · 2016-09-23T08:02:25Z

@benaadams Duh, I totally forgot about the IDE :-p

bbarry · 2016-09-23T23:10:21Z

@SunnyWar, @svick also if you aren't careful in dataflow you can wind up with many allocations related to closures, lambdas and function pointers even if they were synchronous (it seems pretty impossible to avoid at least some in any case; sometimes it might even be reasonable to hold on to references intentionally to lighten GC in particular places).

jcdickinson · 2016-09-26T12:57:25Z

@rstarkov ArraySegment<T> (essentially a slice) and a few home-grown extension methods can help with that, although not perfect. Also, have a look at Duffy's Slices.

timgoodman · 2016-09-29T01:01:38Z

@agocke The fact that a codebase which uses LINQ is not "well-optimized" is exactly the problem. There's no reason in principle why, at least in the more simple cases (which are probably the majority of cases), the compiler couldn't do stack allocation, in-lining, loop fusion, and so forth to produce fast, imperative code. Broadly speaking, isn't that (a big part of) why we have a compiler - so we can write expressive, maintainable code, and let the machine rewrite it as something ugly and fast?

Don't get me wrong, I'm not expecting the compiler to completely free me from having to optimize, but optimizing some of the most common uses of Linq2Objects seems like relatively low-hanging fruit that would benefit a huge number of C# devs.

timgoodman · 2016-09-29T01:13:41Z

@mattwarren That Joe Duffy talk is amazing, thanks for sharing! To what degree is this work already in progress with the C# compiler, as opposed to just in experimental projects like Midori? In particular, the stuff he's talking about at around 23:00 seems a lot like what people here are asking for as far as LINQ optimizations. Is there an issue in this GitHub repo that tracks the progress on that?

benaadams · 2016-09-29T01:24:32Z

@timgoodman there are things here and there dotnet/coreclr#6653

timgoodman · 2016-09-29T04:01:20Z

@benaadams Thanks. I guess I'm not sure why this sort of thing would be under coreclr. The kinds of changes that Joe Duffy was describing seem like compiler optimizations - shouldn't they belong in roslyn or maybe llilc?

timgoodman · 2016-09-29T04:41:47Z

Ah, never mind, I hadn't realized that the coreclr repo contains the JIT compiler. I guess that's where this sort of optimization would need to happen for it to apply to calls to System.Linq methods.

ciplogic · 2017-01-02T11:58:15Z

Great effort!

Sounds to me a bit silly to point it out, but I've noticed in the codebase only one register colorizer: LSRA (Linear Scan) one.

Is it possible to set at least for flags like AggresiveInline to a different register allocator? Maybe BackTracking (the LLVM new one) or a full register allocator?

ciplogic · 2017-01-02T12:13:14Z

It would be great to be minimal CHA or at least for sealed classes to be devirtualized or internal classes in assembly that are not overriden to be considered sealed. Use this information to devirtualize (more aggresively) methods.

Very often using ToString, and so on, cannot be safely devirtualized because there is the possiblity that the methods to be overriden. But in many assemblies private/internal classes are easier to be tracked if they are overriden, especially as assemblies do have the types and relations local.

This operation should increase by a bit the starting time, but it could be enabled into a "performance mode" tier.

damageboy · 2017-03-05T09:43:56Z

Hi All,
I've recently taken some time to "Make C# Benchmark Great Again" by creating a new C# version of the k-nucleotide that beats the old version by a factor of 2 and should probably find itself in position #3 on that specific test.

I mostly did this because I couldn't see Java score better than C#, but my mental issues are not the subject of this issue.

The main improvement with this version (e.g. where most of the fat came off) is the use of ref-return dictionary instead of the .NET Dictionary<TKey, TValue>.

Try as I might I couldn't find a proper discussion of adding new data-structures / new functionality to existing data structures that would add ref-return APIs to System.Collections.Generics and I find it somewhat bizarre...

Is anyone here aware of a discussion / decision regarding this?

It feels too weird for the Roslyn team to drop this very nice new language feature and leave the whole data structures in BCL part out of it, that I feel I should ask if anyone here, whom I assume are very knowledgeable about the hi-perf situation of .NET / C# could elaborate on where we currently stand...?

jcouv · 2017-10-22T05:13:51Z

@ilexp C# 7.2 adds a number of performance-related features: ref readonly, readonly and ref structs (such as Span<T>). We're also considering "ref local reassignment" for a subsequent release.

I didn't seem much activity on this thread for a while. Consider closing if resolved.

ilexp · 2017-10-22T09:43:54Z

@jcouv Yep, have been excitedly watching the new developments in C# and they definitely address some of the points. Others still remain to be discussed or addressed, but the big unsafe / slice / span part is done and discussion has been diverted to the individual issues in CoreCLR and CSharpLang. Closing this, looking forward to future improvements.

Tst2468 · 2018-12-17T17:20:33Z

I worked on highly optimized code, including hand optimized assembly code, in the video game industry for 13 years.
The performance issue is simple, any language, to be fast enough, must produce highly optimized machine code. Period.
We had to turn off many features of C++, including RTC, and not use or write our own versions of STL, and our own memory management for it to be fast enough.
We literally counted every CPU cycle.
We had to optimize beyond the capabilities of the compiler, in terms of both C++ and assembly to get the performance we needed.
Every piece of C++ code that had to perform was disassembled and either hand optimized based on the assembly output or re-coded in assembly by hand.
We had to take into account every aspect of CPU data caching, pipeline optimization, instruction interleaving, branch prediction, and memory management to get the necessary performance.
Any language that ends with native assembly code that can be optimized at that level is fine.
Unfortunately, C#, Java, and other VM based languages were made for portability and safety and not for performance, so they make the job of optimizing much harder.
Well optimized C# code can beat poorly optimized C++ code, but C# cannot match or beat well optimized C++ or assembly code unless you can make it produce comparable machine code.
It's just physics.
Your best bet, if you want to use C# or Java for games is to use it for non-performance intensive parts of the code and call native libraries written in C/C++/Assembly for the high-performance parts.

gafter added Language-C# Discussion Tenet-Performance Regression in measured performance of the product from goals. Area-Language Design labels Apr 6, 2016

ilexp closed this as completed Oct 22, 2017

hypeartist mentioned this issue Jun 5, 2018

Which language features should be implemented in C# v8.0 dotnet/csharplang#1600

Closed

PathogenDavid mentioned this issue Jun 20, 2018

Allow suppression of definite assignment in a struct constructor with explicit layout. dotnet/csharplang#1465

Closed

State / Direction of C# as a High-Performance Language #10378

State / Direction of C# as a High-Performance Language #10378

Comments

ilexp commented Apr 6, 2016 • edited Loading

JoshVarty commented Apr 6, 2016

HaloFour commented Apr 7, 2016

ilexp commented Apr 8, 2016

paulcscharf commented Apr 8, 2016

mattwarren commented Apr 13, 2016

SunnyWar commented Apr 13, 2016

Claytonious commented Apr 14, 2016

ygc369 commented Apr 14, 2016

IanKemp commented Apr 14, 2016

orthoxerox commented Apr 14, 2016

svick commented Apr 14, 2016

SunnyWar commented Apr 14, 2016

msedi commented Apr 14, 2016

ilexp commented Apr 15, 2016 • edited Loading

xoofx commented Apr 15, 2016 • edited Loading

svick commented Apr 15, 2016

GSPP commented Apr 15, 2016 • edited Loading

OtherCrashOverride commented Apr 15, 2016

TimPosey2 commented Apr 15, 2016

GSPP commented Apr 15, 2016

benaadams commented Apr 15, 2016

OtherCrashOverride commented Apr 15, 2016

SunnyWar commented Apr 15, 2016

ilexp commented Apr 15, 2016 • edited Loading

svick commented Apr 15, 2016

SunnyWar commented Apr 15, 2016

mattwarren commented Sep 21, 2016

msedi commented Sep 22, 2016 • edited Loading

bbarry commented Sep 22, 2016

svick commented Sep 22, 2016 • edited Loading

msedi commented Sep 22, 2016

andre-ss6 commented Sep 22, 2016

rstarkov commented Sep 22, 2016 • edited Loading

agocke commented Sep 22, 2016

SunnyWar commented Sep 23, 2016 • edited Loading

benaadams commented Sep 23, 2016

svick commented Sep 23, 2016

agocke commented Sep 23, 2016

benaadams commented Sep 23, 2016

agocke commented Sep 23, 2016

bbarry commented Sep 23, 2016

jcdickinson commented Sep 26, 2016

timgoodman commented Sep 29, 2016

timgoodman commented Sep 29, 2016

benaadams commented Sep 29, 2016

timgoodman commented Sep 29, 2016

timgoodman commented Sep 29, 2016

ciplogic commented Jan 2, 2017

ciplogic commented Jan 2, 2017

damageboy commented Mar 5, 2017

jcouv commented Oct 22, 2017 • edited Loading

ilexp commented Oct 22, 2017

Tst2468 commented Dec 17, 2018

ilexp commented Apr 6, 2016 •

edited

Loading

ilexp commented Apr 15, 2016 •

edited

Loading

xoofx commented Apr 15, 2016 •

edited

Loading

GSPP commented Apr 15, 2016 •

edited

Loading

ilexp commented Apr 15, 2016 •

edited

Loading

msedi commented Sep 22, 2016 •

edited

Loading

svick commented Sep 22, 2016 •

edited

Loading

rstarkov commented Sep 22, 2016 •

edited

Loading

SunnyWar commented Sep 23, 2016 •

edited

Loading

jcouv commented Oct 22, 2017 •

edited

Loading