New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State / Direction of C# as a High-Performance Language #10378

Closed
ilexp opened this Issue Apr 6, 2016 · 166 comments

Comments

Projects
None yet
@ilexp

ilexp commented Apr 6, 2016

I've been following recent development of C# as a language and it seems that there is a strong focus on providing the means to write code more efficiently. This is definitely neat. But what about providing ways to write more efficient code?

For context, I'm using C# mostly for game development (as in "lowlevel / from scratch") which has a habit of gladly abandoning the usual ways of safe code design for that 0.1% of the bottleneck code in favor of maximum efficiency. Unfortunately, there are cases where C# gets in the way of that last bit of optimization.

Issues related to this:

Other sentiments regarding this:

  • The only way to improve / guarantee memory locality right now seems to be putting the data into an array of structs. There are scenarios where this is a bit impractical. Are there ways to handle this with classes?
  • Language support for object pooling / limited control over what exactly "new" does for a given class / reference type.
  • A way to instantiate a reference type "within scope", making cleaning it up more efficient by somehow providing the GC with the extra knowledge about an explicit / intended lifespan.
  • Please share your own in a comment

This is probably more of a broader discussion, but I guess my core question is: Is there a general roadmap regarding potential improvements for performance-focused code in C#?

@JoshVarty

This comment has been minimized.

Show comment
Hide comment
@JoshVarty

JoshVarty Apr 6, 2016

Contributor

A way to instantiate a reference type "within scope", making cleaning it up more efficient by somehow providing the GC with the extra knowledge about an explicit / intended lifespan.

I believe this was also requested in #161

Contributor

JoshVarty commented Apr 6, 2016

A way to instantiate a reference type "within scope", making cleaning it up more efficient by somehow providing the GC with the extra knowledge about an explicit / intended lifespan.

I believe this was also requested in #161

@HaloFour

This comment has been minimized.

Show comment
Hide comment
@HaloFour

HaloFour Apr 7, 2016

  • The only way to improve / guarantee memory locality right now seems to be putting the data into an array of structs. There are scenarios where this is a bit impractical. Are there ways to handle this with classes?
  • A way to instantiate a reference type "within scope", making cleaning it up more efficient by somehow providing the GC with the extra knowledge about an explicit / intended lifespan.

I don't believe that these issues can be solved without direct CLR support. The CLR limits reference types to the heap. Even C++/CLI is forced to abide by that restriction and the stack semantics syntax still allocates on the heap. The GC also provides no facility to directly target specific instances.

I wonder how much C# could make a struct feel like a class before it crosses into unsafe/unverifiable territory. C++/CLI "native" classes are CLR structs so you don't have to deal with allocation/GC but of course the IL it emits is quite nasty.

HaloFour commented Apr 7, 2016

  • The only way to improve / guarantee memory locality right now seems to be putting the data into an array of structs. There are scenarios where this is a bit impractical. Are there ways to handle this with classes?
  • A way to instantiate a reference type "within scope", making cleaning it up more efficient by somehow providing the GC with the extra knowledge about an explicit / intended lifespan.

I don't believe that these issues can be solved without direct CLR support. The CLR limits reference types to the heap. Even C++/CLI is forced to abide by that restriction and the stack semantics syntax still allocates on the heap. The GC also provides no facility to directly target specific instances.

I wonder how much C# could make a struct feel like a class before it crosses into unsafe/unverifiable territory. C++/CLI "native" classes are CLR structs so you don't have to deal with allocation/GC but of course the IL it emits is quite nasty.

@ilexp

This comment has been minimized.

Show comment
Hide comment
@ilexp

ilexp Apr 8, 2016

I've added some more related issues to the above list, which hadn't been mentioned yet.

ilexp commented Apr 8, 2016

I've added some more related issues to the above list, which hadn't been mentioned yet.

@amulware

This comment has been minimized.

Show comment
Hide comment
@amulware

amulware Apr 8, 2016

I am in a very similar position to @ilexp, and generally interested in the performance of my code, and knowing how to write efficient code. So I'd second the importance of this discussion.

I also think the summary and points in the original post are quite good, and have nothing to add at the moment.

Small note on using structs sort of like classes (but keeping everything on the stack):
I believe we can 'just' pass our structures down as ref for this purpose?
Make sure you don't do anything that it that creates a copy, and it should look like a class...
Not sure if that work flow needs any additional support from the language.

About memory locality: I was under the impression that if I new two class-objects after each other, they will also be directly after each other in memory, and stay that way? May be an implementation detail, but it's better than nothing... That being said, I've had to move from lists of objects to arrays of structs for performance reasons as well (good example would be particle systems, or similar simulations that have many small and short lived objects). Just the overhead from resolving references and having to gc the objects eventually made my original solution unfeasible. I am not sure this can be 'fixed' in a managed language at all though...

Looking forward to seeing what others have to say on this topic!

amulware commented Apr 8, 2016

I am in a very similar position to @ilexp, and generally interested in the performance of my code, and knowing how to write efficient code. So I'd second the importance of this discussion.

I also think the summary and points in the original post are quite good, and have nothing to add at the moment.

Small note on using structs sort of like classes (but keeping everything on the stack):
I believe we can 'just' pass our structures down as ref for this purpose?
Make sure you don't do anything that it that creates a copy, and it should look like a class...
Not sure if that work flow needs any additional support from the language.

About memory locality: I was under the impression that if I new two class-objects after each other, they will also be directly after each other in memory, and stay that way? May be an implementation detail, but it's better than nothing... That being said, I've had to move from lists of objects to arrays of structs for performance reasons as well (good example would be particle systems, or similar simulations that have many small and short lived objects). Just the overhead from resolving references and having to gc the objects eventually made my original solution unfeasible. I am not sure this can be 'fixed' in a managed language at all though...

Looking forward to seeing what others have to say on this topic!

@mattwarren

This comment has been minimized.

Show comment
Hide comment
@mattwarren

mattwarren Apr 13, 2016

The only way to improve / guarantee memory locality right now seems to be putting the data into an array of structs. There are scenarios where this is a bit impractical. Are there ways to handle this with classes?

There was a really nice prototype done by @xoofx showing the perf improvements of allowing stackalloc on reference types.

mattwarren commented Apr 13, 2016

The only way to improve / guarantee memory locality right now seems to be putting the data into an array of structs. There are scenarios where this is a bit impractical. Are there ways to handle this with classes?

There was a really nice prototype done by @xoofx showing the perf improvements of allowing stackalloc on reference types.

@SunnyWar

This comment has been minimized.

Show comment
Hide comment
@SunnyWar

SunnyWar Apr 13, 2016

The only way to improve / guarantee memory locality right now seems to be putting the data into an array of structs. There are scenarios where this is a bit impractical. Are there ways to handle this with classes?

Microsoft Research many years ago experimented with using some unused bits on each object as access counters. The research hacked the heap to re-organized mostly used objects so that they ended up on the same page. He showed in a sample XML parser that C# code was faster than optimized C++. The talk he gave on it was called, "Making C# faster than C#". The researcher that developed the technique left MS and the research apparently died with him. He had a long list of other, similar improvements that he was planning on trying. None of which, I believe, saw daylight.

Perhaps this work should be resuscitated so that the promise (made in the beginning: remember how the JITer was going to ultra-optimize for your hardware??) can be realized.

SunnyWar commented Apr 13, 2016

The only way to improve / guarantee memory locality right now seems to be putting the data into an array of structs. There are scenarios where this is a bit impractical. Are there ways to handle this with classes?

Microsoft Research many years ago experimented with using some unused bits on each object as access counters. The research hacked the heap to re-organized mostly used objects so that they ended up on the same page. He showed in a sample XML parser that C# code was faster than optimized C++. The talk he gave on it was called, "Making C# faster than C#". The researcher that developed the technique left MS and the research apparently died with him. He had a long list of other, similar improvements that he was planning on trying. None of which, I believe, saw daylight.

Perhaps this work should be resuscitated so that the promise (made in the beginning: remember how the JITer was going to ultra-optimize for your hardware??) can be realized.

@Claytonious

This comment has been minimized.

Show comment
Hide comment
@Claytonious

Claytonious Apr 14, 2016

We are in the crowded boat of using c# with Unity3d, which may finally be moving toward a newer CLR sometime soon, so this discussion is of great interest to us. Thanks for starting it.

The request to have at least some hinting to the GC, even if not direct control, is at the top of our iist. As the programmer, we are in a position to declaratively "help" the GC but have no opportunity to do so.

Claytonious commented Apr 14, 2016

We are in the crowded boat of using c# with Unity3d, which may finally be moving toward a newer CLR sometime soon, so this discussion is of great interest to us. Thanks for starting it.

The request to have at least some hinting to the GC, even if not direct control, is at the top of our iist. As the programmer, we are in a position to declaratively "help" the GC but have no opportunity to do so.

@ygc369

This comment has been minimized.

Show comment
Hide comment
@IanKemp

This comment has been minimized.

Show comment
Hide comment
@IanKemp

IanKemp Apr 14, 2016

"game development... has a habit of gladly abandoning the usual ways of safe code design for that 0.1% of the bottleneck code in favor of maximum efficiency. Unfortunately, there are cases where C# gets in the way of that last bit of optimization."

C# gets in the way because that's what it was designed to do.

If you want to write code that disregards correctness in favour of performance, you should be writing that code in a language that doesn't enforce correctness (C/C++), not trying to make a correctness-enforcing language less so. Especially since scenarios where performance is preferable to correctness, is an extremely tiny minority of C# use cases.

IanKemp commented Apr 14, 2016

"game development... has a habit of gladly abandoning the usual ways of safe code design for that 0.1% of the bottleneck code in favor of maximum efficiency. Unfortunately, there are cases where C# gets in the way of that last bit of optimization."

C# gets in the way because that's what it was designed to do.

If you want to write code that disregards correctness in favour of performance, you should be writing that code in a language that doesn't enforce correctness (C/C++), not trying to make a correctness-enforcing language less so. Especially since scenarios where performance is preferable to correctness, is an extremely tiny minority of C# use cases.

@orthoxerox

This comment has been minimized.

Show comment
Hide comment
@orthoxerox

orthoxerox Apr 14, 2016

Contributor

@IanKemp that's a very narrow view of C#. There are languages like Rust that try to maximize correctness without run-time overhead, so it's not one vs the other. While C# is a garbage-collected language by design, with all the benefits and penalties that it brings, there's no reason why we cannot ask for performance-oriented improvements, like cache-friendly allocations of collections of reference types or deterministic deallocation, for example. Even LOB applications have performance bottlenecks, not just computer games or science-related scripts.

Contributor

orthoxerox commented Apr 14, 2016

@IanKemp that's a very narrow view of C#. There are languages like Rust that try to maximize correctness without run-time overhead, so it's not one vs the other. While C# is a garbage-collected language by design, with all the benefits and penalties that it brings, there's no reason why we cannot ask for performance-oriented improvements, like cache-friendly allocations of collections of reference types or deterministic deallocation, for example. Even LOB applications have performance bottlenecks, not just computer games or science-related scripts.

@svick

This comment has been minimized.

Show comment
Hide comment
@svick

svick Apr 14, 2016

Contributor

@IanKemp Are you saying that unsafe does not exist? C# had that from the start and it's exactly for that small amount of code where you'll willing to sacrifice safety for performance.

Contributor

svick commented Apr 14, 2016

@IanKemp Are you saying that unsafe does not exist? C# had that from the start and it's exactly for that small amount of code where you'll willing to sacrifice safety for performance.

@SunnyWar

This comment has been minimized.

Show comment
Hide comment
@SunnyWar

SunnyWar Apr 14, 2016

Hey, people...try this: write a function will result in no garbage collections....something with a bunch of math in it for example. Write the exact same code in C++. See which is faster. The C++ compiler will always generate as fast or faster code (usually faster). The Intel compiler is most often even faster...it has nothing to do with the language.

For example. I wrote a PCM audio mixer is C#, C++ and compile with the .Net, MS, and Intel compilers. The code in question had no GC, no boundary checks, no excuses.

C#: slowest
C++ Microsoft: fast
C++ Intel: super fast

In this example the Intel compiler recognized that computation could be replaced by SSE2 instructions. the Microsoft compiler wasn't so smart, but it was smarter than the .Net compiler/JITer.

So I keep hearing talk about adding extensions to the language to help the GC do things more efficiently, but it seems to me ...the language isn't the problem. Even if those suggestion are taken we're still hamstrung by an intentionally slow code generating compiler/jitter. It's the compiler and the GC that should be doing a better job.

See: #4331 I'm really tired of the C++ guys saying, "we don't use it because it's too slow" when there is _very little reason _for it to be slow.

BTW: I'm in the camp of people that doesn't care how long the JITer takes to do its job. Most of the world's code runs on servers...why isn't it optimized to do so?

SunnyWar commented Apr 14, 2016

Hey, people...try this: write a function will result in no garbage collections....something with a bunch of math in it for example. Write the exact same code in C++. See which is faster. The C++ compiler will always generate as fast or faster code (usually faster). The Intel compiler is most often even faster...it has nothing to do with the language.

For example. I wrote a PCM audio mixer is C#, C++ and compile with the .Net, MS, and Intel compilers. The code in question had no GC, no boundary checks, no excuses.

C#: slowest
C++ Microsoft: fast
C++ Intel: super fast

In this example the Intel compiler recognized that computation could be replaced by SSE2 instructions. the Microsoft compiler wasn't so smart, but it was smarter than the .Net compiler/JITer.

So I keep hearing talk about adding extensions to the language to help the GC do things more efficiently, but it seems to me ...the language isn't the problem. Even if those suggestion are taken we're still hamstrung by an intentionally slow code generating compiler/jitter. It's the compiler and the GC that should be doing a better job.

See: #4331 I'm really tired of the C++ guys saying, "we don't use it because it's too slow" when there is _very little reason _for it to be slow.

BTW: I'm in the camp of people that doesn't care how long the JITer takes to do its job. Most of the world's code runs on servers...why isn't it optimized to do so?

@msedi

This comment has been minimized.

Show comment
Hide comment
@msedi

msedi Apr 14, 2016

I completely agree with all of the mentioned improvements. These are in my opinion absolutely mandatory. Using C# in high performance applications is the right was. It makes code much easier to read if there would be at least some of the suggested improvements. Currently we have to "leave" the language to C++ or C to create things there are not possible in C#, and i don't mean assembler instructions but very simple pointer operations on blittable Data types or generics.

So not to leave the language i created unreadable Code fragments just not to use unmanaged code because then i am dependent on x86 and x64.

msedi commented Apr 14, 2016

I completely agree with all of the mentioned improvements. These are in my opinion absolutely mandatory. Using C# in high performance applications is the right was. It makes code much easier to read if there would be at least some of the suggested improvements. Currently we have to "leave" the language to C++ or C to create things there are not possible in C#, and i don't mean assembler instructions but very simple pointer operations on blittable Data types or generics.

So not to leave the language i created unreadable Code fragments just not to use unmanaged code because then i am dependent on x86 and x64.

@ilexp

This comment has been minimized.

Show comment
Hide comment
@ilexp

ilexp Apr 15, 2016

BTW: I'm in the camp of people that doesn't care how long the JITer takes to do its job. Most of the world's code runs on servers...why isn't it optimized to do so?

From a gamedev perspective, it would be neat if there was a way to tell the runtime to perform extended JIT optimization using framework API.

Let's say by default, there is only the regular, fast optimization, the application starts up quickly and all behaves as usual. Then I enter the loading screen, because I'll have to load levels and assets anyway - now would be an excellent time to tell the runtime to JIT optimize the heck out of everything, because the user is waiting anyway and expecting to do so. This could happen on a per-method, per-class or per-Assembly level. Maybe you don't need 90% of the code to be optimized that well, but that one method, class or Assembly should be.

As far as server applications go, they could very well do the same in the initialization phase. Same for audio, image and video processing software. Extended JIT optimization could be a very powerful opt-in and on runtimes that do not support this, the API commands can still just fall back to not having any effect.

Maybe it would even be possible to somehow cache the super-optimized machine code somewhere, so it doesn't need to be re-done at the next startup unless modified or copied to a different machine. Maybe partial caches would be possible, so even if not all code is super-JITed yet, at least the parts that are will be available. Which would be a lot more convenient and portable than pre-compiling an Assembly to native machine code, simply because Assemblies can run anywhere and native machine code can not.

All that said, I think both allowing the JIT to do a better job and allowing developers to write more efficient code in the first place would be equally welcome. I don't think this should be an either / or decision.

ilexp commented Apr 15, 2016

BTW: I'm in the camp of people that doesn't care how long the JITer takes to do its job. Most of the world's code runs on servers...why isn't it optimized to do so?

From a gamedev perspective, it would be neat if there was a way to tell the runtime to perform extended JIT optimization using framework API.

Let's say by default, there is only the regular, fast optimization, the application starts up quickly and all behaves as usual. Then I enter the loading screen, because I'll have to load levels and assets anyway - now would be an excellent time to tell the runtime to JIT optimize the heck out of everything, because the user is waiting anyway and expecting to do so. This could happen on a per-method, per-class or per-Assembly level. Maybe you don't need 90% of the code to be optimized that well, but that one method, class or Assembly should be.

As far as server applications go, they could very well do the same in the initialization phase. Same for audio, image and video processing software. Extended JIT optimization could be a very powerful opt-in and on runtimes that do not support this, the API commands can still just fall back to not having any effect.

Maybe it would even be possible to somehow cache the super-optimized machine code somewhere, so it doesn't need to be re-done at the next startup unless modified or copied to a different machine. Maybe partial caches would be possible, so even if not all code is super-JITed yet, at least the parts that are will be available. Which would be a lot more convenient and portable than pre-compiling an Assembly to native machine code, simply because Assemblies can run anywhere and native machine code can not.

All that said, I think both allowing the JIT to do a better job and allowing developers to write more efficient code in the first place would be equally welcome. I don't think this should be an either / or decision.

@xoofx

This comment has been minimized.

Show comment
Hide comment
@xoofx

xoofx Apr 15, 2016

Member

While advocating for many years about performance for C#, I completely concur with the fact that It would be great to see more investments in this area.

Most notably on the following 3 axes:

  1. Allow to switch-on a better code gen JIT (but slower). There is high hope that this will be fulfilled by the undergoing work on LLILC, for both JIT and AOT scenarios. Note that many platforms (e.g iOS, UWP/XboxOne, PS4) don't support JIT scenarios. But It will take time to achieve even performance parity with the current JIT and there are some language/runtime constraints that could make full optimization difficult (GC statepoints, array/null/arithmetic safe checks...etc.)
  2. Improve the language (with sometimes a proper JIT/GC support) that could help in this area. That include things listed above like ref locals, array slices, string slices... and even builtin utf8 strings... Some hacks can be done by post-processing IL and have been abused in many projects, but it would be great to have these little things available without making any IL voodoo.
  3. Put a lot more emphasis on memory management, data locality and GC pressure
    • Standard improvements like stack alloc for class, embeded class instance, borrowed pointers
    • Rethink our usage of the GC, while a bit more problematic, as I haven't seen much proven models in production (things like: explicit vs implicit management of GC regions to allocate known objects to a proper region of objects that would relate in terms of locality/longevity)

Unfortunately, there are also some breaking change scenarios that would require to fork the language/runtime to correctly address some of the intrinsic weakness of the current language/runtime model (e.g things that have been done for Midori for their Error Model or safe native code for example...etc.)

Member

xoofx commented Apr 15, 2016

While advocating for many years about performance for C#, I completely concur with the fact that It would be great to see more investments in this area.

Most notably on the following 3 axes:

  1. Allow to switch-on a better code gen JIT (but slower). There is high hope that this will be fulfilled by the undergoing work on LLILC, for both JIT and AOT scenarios. Note that many platforms (e.g iOS, UWP/XboxOne, PS4) don't support JIT scenarios. But It will take time to achieve even performance parity with the current JIT and there are some language/runtime constraints that could make full optimization difficult (GC statepoints, array/null/arithmetic safe checks...etc.)
  2. Improve the language (with sometimes a proper JIT/GC support) that could help in this area. That include things listed above like ref locals, array slices, string slices... and even builtin utf8 strings... Some hacks can be done by post-processing IL and have been abused in many projects, but it would be great to have these little things available without making any IL voodoo.
  3. Put a lot more emphasis on memory management, data locality and GC pressure
    • Standard improvements like stack alloc for class, embeded class instance, borrowed pointers
    • Rethink our usage of the GC, while a bit more problematic, as I haven't seen much proven models in production (things like: explicit vs implicit management of GC regions to allocate known objects to a proper region of objects that would relate in terms of locality/longevity)

Unfortunately, there are also some breaking change scenarios that would require to fork the language/runtime to correctly address some of the intrinsic weakness of the current language/runtime model (e.g things that have been done for Midori for their Error Model or safe native code for example...etc.)

@svick

This comment has been minimized.

Show comment
Hide comment
@svick

svick Apr 15, 2016

Contributor

@SunnyWar I think there's enough enough room to optimize both code generation for math and GC.

As to which one should have higher priority, keep in mind that it's relatively easy to workaround bad performance in math by PInvoking native code or using Vector<T>. Working around bad performance due to GC overhead tends to be much harder, I think.

And since you mention servers, a big part of their performance are things like "how long it take to allocate a buffer", not "how long does it take to execute math-heavy code".

Contributor

svick commented Apr 15, 2016

@SunnyWar I think there's enough enough room to optimize both code generation for math and GC.

As to which one should have higher priority, keep in mind that it's relatively easy to workaround bad performance in math by PInvoking native code or using Vector<T>. Working around bad performance due to GC overhead tends to be much harder, I think.

And since you mention servers, a big part of their performance are things like "how long it take to allocate a buffer", not "how long does it take to execute math-heavy code".

@GSPP

This comment has been minimized.

Show comment
Hide comment
@GSPP

GSPP Apr 15, 2016

I'm adding JIT tiering to the list of features I see as required to make C# a truly high performance language. It is one of the highest impact changes that can be done at the CLR level.

JIT tiering has impact on the C# language design (counter-intuitively). A strong second tier JIT can optimize away abstractions. This can cause C# features to become truly cost free.

For example, if escape analysis and stack allocation of ref types was consistently working the C# language could take a more liberal stance on allocations.

If devirtualization was working better (right now: not all all in RyuJIT) abstractions such as Enumerable.* queries could become cost free (identical performance to manually written loops).

I imagine records and pattern matching a features that tend to cause more allocations and more runtime type tests. These are very amenable to advanced optimizations.

GSPP commented Apr 15, 2016

I'm adding JIT tiering to the list of features I see as required to make C# a truly high performance language. It is one of the highest impact changes that can be done at the CLR level.

JIT tiering has impact on the C# language design (counter-intuitively). A strong second tier JIT can optimize away abstractions. This can cause C# features to become truly cost free.

For example, if escape analysis and stack allocation of ref types was consistently working the C# language could take a more liberal stance on allocations.

If devirtualization was working better (right now: not all all in RyuJIT) abstractions such as Enumerable.* queries could become cost free (identical performance to manually written loops).

I imagine records and pattern matching a features that tend to cause more allocations and more runtime type tests. These are very amenable to advanced optimizations.

@OtherCrashOverride

This comment has been minimized.

Show comment
Hide comment
@OtherCrashOverride

OtherCrashOverride Apr 15, 2016

Born out of a recent discussion with others, I think its time to review the "unsafe" syntax. The discussion can be summarized as "Does 'unsafe' even matter anymore?" .Net is moving "out of the security business" with CoreCLR. In a game development scenario, most of the work involves pointers to blocks of data. It would help if there was less syntactic verbosity in using pointers directly.

Support for SSE4 Intrinsics: CoreFx Issue 2209

This is completely useless on the billions of ARM devices out there in the world.

With regard to the GC discussion, I do not think that further GC abuse/workarounds are the solution. Instead there needs to be a deterministic alloc/ctor/dtor/free pattern. Typically this is done with reference counting. Today's systems are mutli-core, and today's programs are multi-threaded. "Stop the world" is a very expensive operation.

In conclusion, what is actually desired is the C# language and libraries but on top of a next-generation runtime better suited for the needs of "real-time" (deterministic) development such as games. That is currently beyond the scope of CoreCLR. However, with everything finally open source, its now possible to gather a like minded group to pursue research into it as a different project.

OtherCrashOverride commented Apr 15, 2016

Born out of a recent discussion with others, I think its time to review the "unsafe" syntax. The discussion can be summarized as "Does 'unsafe' even matter anymore?" .Net is moving "out of the security business" with CoreCLR. In a game development scenario, most of the work involves pointers to blocks of data. It would help if there was less syntactic verbosity in using pointers directly.

Support for SSE4 Intrinsics: CoreFx Issue 2209

This is completely useless on the billions of ARM devices out there in the world.

With regard to the GC discussion, I do not think that further GC abuse/workarounds are the solution. Instead there needs to be a deterministic alloc/ctor/dtor/free pattern. Typically this is done with reference counting. Today's systems are mutli-core, and today's programs are multi-threaded. "Stop the world" is a very expensive operation.

In conclusion, what is actually desired is the C# language and libraries but on top of a next-generation runtime better suited for the needs of "real-time" (deterministic) development such as games. That is currently beyond the scope of CoreCLR. However, with everything finally open source, its now possible to gather a like minded group to pursue research into it as a different project.

@TPoise

This comment has been minimized.

Show comment
Hide comment
@TPoise

TPoise Apr 15, 2016

I'm doing a lot of high-perf / low latency work in C#. One thing that would be "the killer feature" for perf work is for them to get .NET Native fully working. I know it's close, but the recent community standups have said that it won't be part of v1.0 RTM and they're rethinking the usage for it. The VS C++ compiler is amazing at auto-vectorizing, dead code elimination, constant folding, etc. It just does this better than I can hand-optimize C# in its limited ways. I believe traditional JIT compiling (not just RyuJIT) just doesn't have enough time to do all of those optimizations at run-time. I would be in favor of giving up additional compile time, portability, and reflection in exchange for better runtime performance; and I suspect those that are contributing to this thread here probably feels the same way. For those that aren't, then you still have RyuJIT.

Second, if there were some tuning knobs available for the CLR itself.

TPoise commented Apr 15, 2016

I'm doing a lot of high-perf / low latency work in C#. One thing that would be "the killer feature" for perf work is for them to get .NET Native fully working. I know it's close, but the recent community standups have said that it won't be part of v1.0 RTM and they're rethinking the usage for it. The VS C++ compiler is amazing at auto-vectorizing, dead code elimination, constant folding, etc. It just does this better than I can hand-optimize C# in its limited ways. I believe traditional JIT compiling (not just RyuJIT) just doesn't have enough time to do all of those optimizations at run-time. I would be in favor of giving up additional compile time, portability, and reflection in exchange for better runtime performance; and I suspect those that are contributing to this thread here probably feels the same way. For those that aren't, then you still have RyuJIT.

Second, if there were some tuning knobs available for the CLR itself.

@GSPP

This comment has been minimized.

Show comment
Hide comment
@GSPP

GSPP Apr 15, 2016

Adding a proposal for Heap objects with custom allocator and explicit delete. That way latency-sensitive code can take control of allocation and deallocation while integrating nicely with an otherwise safe managed application.

It's basically a nicer and more practical new/delete.

GSPP commented Apr 15, 2016

Adding a proposal for Heap objects with custom allocator and explicit delete. That way latency-sensitive code can take control of allocation and deallocation while integrating nicely with an otherwise safe managed application.

It's basically a nicer and more practical new/delete.

@benaadams

This comment has been minimized.

Show comment
Hide comment
@benaadams

benaadams Apr 15, 2016

Contributor

@OtherCrashOverride @GSPP Destructible Types? #161

Contributor

benaadams commented Apr 15, 2016

@OtherCrashOverride @GSPP Destructible Types? #161

@OtherCrashOverride

This comment has been minimized.

Show comment
Hide comment
@OtherCrashOverride

OtherCrashOverride Apr 15, 2016

Ideally, we want to get rid of IDisposable entirely and directly call the dtor (finalizer) when the object is no longer in use (garbage). Without this, the GC still has to stop all threads of execution to trace object use and the dtor is always called on a different thread of execution.

This implies we need to add reference counting and modify the compiler to increment and decrement the count as appropriate such as when a variable is copied or goes out of scope. You could then, for example, hint that you would like to allocate an object on the stack and then have it automatically 'boxed' (promoted) to a heap value if its reference count is greater than zero when it goes out of scope. This would eliminate "escape analysis" requirements.

Of course, all this is speculation at this point. But the theoretical benefits warrant research and exploration in a separate project. I suspect there is much more to gain from redesigning the runtime than there is from adding more rules and complication to the language.

OtherCrashOverride commented Apr 15, 2016

Ideally, we want to get rid of IDisposable entirely and directly call the dtor (finalizer) when the object is no longer in use (garbage). Without this, the GC still has to stop all threads of execution to trace object use and the dtor is always called on a different thread of execution.

This implies we need to add reference counting and modify the compiler to increment and decrement the count as appropriate such as when a variable is copied or goes out of scope. You could then, for example, hint that you would like to allocate an object on the stack and then have it automatically 'boxed' (promoted) to a heap value if its reference count is greater than zero when it goes out of scope. This would eliminate "escape analysis" requirements.

Of course, all this is speculation at this point. But the theoretical benefits warrant research and exploration in a separate project. I suspect there is much more to gain from redesigning the runtime than there is from adding more rules and complication to the language.

@SunnyWar

This comment has been minimized.

Show comment
Hide comment
@SunnyWar

SunnyWar Apr 15, 2016

@OtherCrashOverride I've also come to the conclusion that a reference counting solution is critical for solving a number of problems.

For example, some years ago I wrote message passing service using an Actor model. The problem I ran into right away is I was allocating millions of small objects (for messaging coming in) and the GC pressure to clean after they went out was horrid. I ended up wrapping them in a reference counting object to essentially cache them. It solved the problem BUT I was back to old, ugly, COM days of having to insure every Actor behaved and did an AddRef/Release for every message it processed. It worked..but it was ugly and I still dream of a day I can have a CLR managed reference countable object with an overloadable OnRelease, so that I can put it back in the queue when the count==0 rather than let it be GC'd.

SunnyWar commented Apr 15, 2016

@OtherCrashOverride I've also come to the conclusion that a reference counting solution is critical for solving a number of problems.

For example, some years ago I wrote message passing service using an Actor model. The problem I ran into right away is I was allocating millions of small objects (for messaging coming in) and the GC pressure to clean after they went out was horrid. I ended up wrapping them in a reference counting object to essentially cache them. It solved the problem BUT I was back to old, ugly, COM days of having to insure every Actor behaved and did an AddRef/Release for every message it processed. It worked..but it was ugly and I still dream of a day I can have a CLR managed reference countable object with an overloadable OnRelease, so that I can put it back in the queue when the count==0 rather than let it be GC'd.

@ilexp

This comment has been minimized.

Show comment
Hide comment
@ilexp

ilexp Apr 15, 2016

Don't want to detail the rest of it in this general overview thread, just regarding this specific point of @OtherCrashOverride's posting:

[...] than there is from adding more rules and complication to the language.

As a general direction of design with regard to future "efficient code" additions, I think it would be a good thing to keep most or even all of them - both language features and specialized API - hidden away just enough so nobody can stumble upon them accidentally, following the overall "pit of success" rule if you will.

I would very much like to avoid a situation where improving 0.1% of performance critical code would lead to an overall increase in complexity and confusion for the 99.9% of regular code. Removing the safety belt in C# needs to be a conscious and (ideally) local decision, so as long as you don't screw up in that specific code area, it should be transparent to all the other code in your project, or other projects using your library.

ilexp commented Apr 15, 2016

Don't want to detail the rest of it in this general overview thread, just regarding this specific point of @OtherCrashOverride's posting:

[...] than there is from adding more rules and complication to the language.

As a general direction of design with regard to future "efficient code" additions, I think it would be a good thing to keep most or even all of them - both language features and specialized API - hidden away just enough so nobody can stumble upon them accidentally, following the overall "pit of success" rule if you will.

I would very much like to avoid a situation where improving 0.1% of performance critical code would lead to an overall increase in complexity and confusion for the 99.9% of regular code. Removing the safety belt in C# needs to be a conscious and (ideally) local decision, so as long as you don't screw up in that specific code area, it should be transparent to all the other code in your project, or other projects using your library.

@svick

This comment has been minimized.

Show comment
Hide comment
@svick

svick Apr 15, 2016

Contributor

@OtherCrashOverride

You could then, for example, hint that you would like to allocate an object on the stack and then have it automatically 'boxed' (promoted) to a heap value if its reference count is greater than zero when it goes out of scope. This would eliminate "escape analysis" requirements.

That would require you to find and update all existing references to that object. While the GC already does that when compacting, I doubt doing it potentially at every method return would be efficient.

Contributor

svick commented Apr 15, 2016

@OtherCrashOverride

You could then, for example, hint that you would like to allocate an object on the stack and then have it automatically 'boxed' (promoted) to a heap value if its reference count is greater than zero when it goes out of scope. This would eliminate "escape analysis" requirements.

That would require you to find and update all existing references to that object. While the GC already does that when compacting, I doubt doing it potentially at every method return would be efficient.

@SunnyWar

This comment has been minimized.

Show comment
Hide comment
@SunnyWar

SunnyWar Apr 15, 2016

Today, the system imposes limitations on us that are purely historical and in no way limit how things can be done in the future.

SunnyWar commented Apr 15, 2016

Today, the system imposes limitations on us that are purely historical and in no way limit how things can be done in the future.

@OtherCrashOverride

This comment has been minimized.

Show comment
Hide comment
@OtherCrashOverride

OtherCrashOverride Apr 15, 2016

That would require you to find and update all existing references to that object.

A reference should be a handle and therefore abstract whether the storage is stack or heap. The runtime references the handle, not the actual pointer, requiring only the handle itself be updated when promoting a stack pointer to a heap pointer. Again, this is all theoretical.

OtherCrashOverride commented Apr 15, 2016

That would require you to find and update all existing references to that object.

A reference should be a handle and therefore abstract whether the storage is stack or heap. The runtime references the handle, not the actual pointer, requiring only the handle itself be updated when promoting a stack pointer to a heap pointer. Again, this is all theoretical.

@xoofx

This comment has been minimized.

Show comment
Hide comment
@xoofx

xoofx Apr 15, 2016

Member

@OtherCrashOverride It is misleading to think that switching from a GC to a ref counting scheme will simply solve the problem of memory management for good. The latest research on the matter are even blurring the lines between the two techniques...

Afaik, the most recent research on the subject, RC Immix conservative (2015) which is a continuation of RC Immix (2013) which is a derivation of GC Immix (2008), shows that the best RC collector is able to outperform just slightly the best GC (RC Immix conservative vs GenImmix). Note also that you need a kind of reference cycle collector for a RC Immix to be fully working (detect cycle in object graphs and collect them). You will see in these documents that RC Immix is build upon this 3 key points:

  1. The original Immix memory organization (which is a lot related to locality and friendly with CPU cache lines)
  2. Reference counting is not occuring on each object but on an Immix heap line
  3. A compacting memory scheme

Though I would be personally in favor of switching to a RC Immix scheme, mostly because it delivers a better predictability in "collecting" objects (note even with RC Immix, collection of objects is not immediate in order to achieve better performance)

That being said, again, there is a strong need for other memory management models to fill the gap in terms of performance, because the GC/RC model in itself is not enough alone (alloc class on the stack, alloc of embed instance in a class, borrowed/owned pointers (single owner reference which are destructible once there is no more owner))

Member

xoofx commented Apr 15, 2016

@OtherCrashOverride It is misleading to think that switching from a GC to a ref counting scheme will simply solve the problem of memory management for good. The latest research on the matter are even blurring the lines between the two techniques...

Afaik, the most recent research on the subject, RC Immix conservative (2015) which is a continuation of RC Immix (2013) which is a derivation of GC Immix (2008), shows that the best RC collector is able to outperform just slightly the best GC (RC Immix conservative vs GenImmix). Note also that you need a kind of reference cycle collector for a RC Immix to be fully working (detect cycle in object graphs and collect them). You will see in these documents that RC Immix is build upon this 3 key points:

  1. The original Immix memory organization (which is a lot related to locality and friendly with CPU cache lines)
  2. Reference counting is not occuring on each object but on an Immix heap line
  3. A compacting memory scheme

Though I would be personally in favor of switching to a RC Immix scheme, mostly because it delivers a better predictability in "collecting" objects (note even with RC Immix, collection of objects is not immediate in order to achieve better performance)

That being said, again, there is a strong need for other memory management models to fill the gap in terms of performance, because the GC/RC model in itself is not enough alone (alloc class on the stack, alloc of embed instance in a class, borrowed/owned pointers (single owner reference which are destructible once there is no more owner))

@OtherCrashOverride

This comment has been minimized.

Show comment
Hide comment
@OtherCrashOverride

OtherCrashOverride Apr 15, 2016

It is misleading to think that switching from a GC to a ref counting scheme will simply solve the problem of memory management for good.

The goal is not to solve the problem of memory management; rather, to make it deterministic. The focus is on when object cleanup takes places rather than how long it takes. Currently, games and other real-time systems suffer due to the GC halting everything for an indeterminate amount of time while collection takes place. Additionally, running finalizers in different threads causes issues for non-thread-safe GUI APIs and 3D APIs like DirectX and OpenGL. Controlling when an object is collected allows the developer to amortize this cost as needed for the program.

[edit]
Here is an example of a real-world problem deterministic memory management would solve:
http://geekswithblogs.net/akraus1/archive/2016/04/14/174476.aspx

OtherCrashOverride commented Apr 15, 2016

It is misleading to think that switching from a GC to a ref counting scheme will simply solve the problem of memory management for good.

The goal is not to solve the problem of memory management; rather, to make it deterministic. The focus is on when object cleanup takes places rather than how long it takes. Currently, games and other real-time systems suffer due to the GC halting everything for an indeterminate amount of time while collection takes place. Additionally, running finalizers in different threads causes issues for non-thread-safe GUI APIs and 3D APIs like DirectX and OpenGL. Controlling when an object is collected allows the developer to amortize this cost as needed for the program.

[edit]
Here is an example of a real-world problem deterministic memory management would solve:
http://geekswithblogs.net/akraus1/archive/2016/04/14/174476.aspx

@xoofx

This comment has been minimized.

Show comment
Hide comment
@xoofx

xoofx Apr 16, 2016

Member

The goal is not to solve the problem of memory management; rather, to make it deterministic.

Sure, I know what a RC is good for and that's what I implied by 😉

mostly because it delivers a better predictability in "collecting" objects

But the determinism can be reinforced by alternative forms. The following options are perfectly deterministic, and they deliver even better performance in their use cases compare to a GC/RC scenarios because they are lighter and exactly localized to where they are used:

  • Allocation of a class on the stack
  • Embed instance
  • Borrowed pointer (known also as single owner reference)

Note that in order for a RC scheme to be enough efficient, you still need to batch the collection of objects (see what RC Immix is doing). Also, you usually don't expect having to run a finalizer on every object.

If you have followed the development of a language like Rust, they started to have RC objects along Borrowed/Owner reference, but at some point, they almost completely ditch RC and they are able to live without much using it. One idea lies into the fact that most objects used in a program are not expected to be concurrently shared between threads (or even stored in another objects for stack alloc cases...etc.), and this simple observation can lead to some interesting language designs/patterns/optims that RC/GC scenarios are not able to leverage.

Bottom line is: allocation on the GC/RC heap should be the exception, not the rule. That's where deterministic allocation/deallocation really starts to shine in modern/efficient language. But that's a paradigm that is not easy to introduce afterwards without strong breaking changes.

Member

xoofx commented Apr 16, 2016

The goal is not to solve the problem of memory management; rather, to make it deterministic.

Sure, I know what a RC is good for and that's what I implied by 😉

mostly because it delivers a better predictability in "collecting" objects

But the determinism can be reinforced by alternative forms. The following options are perfectly deterministic, and they deliver even better performance in their use cases compare to a GC/RC scenarios because they are lighter and exactly localized to where they are used:

  • Allocation of a class on the stack
  • Embed instance
  • Borrowed pointer (known also as single owner reference)

Note that in order for a RC scheme to be enough efficient, you still need to batch the collection of objects (see what RC Immix is doing). Also, you usually don't expect having to run a finalizer on every object.

If you have followed the development of a language like Rust, they started to have RC objects along Borrowed/Owner reference, but at some point, they almost completely ditch RC and they are able to live without much using it. One idea lies into the fact that most objects used in a program are not expected to be concurrently shared between threads (or even stored in another objects for stack alloc cases...etc.), and this simple observation can lead to some interesting language designs/patterns/optims that RC/GC scenarios are not able to leverage.

Bottom line is: allocation on the GC/RC heap should be the exception, not the rule. That's where deterministic allocation/deallocation really starts to shine in modern/efficient language. But that's a paradigm that is not easy to introduce afterwards without strong breaking changes.

@OtherCrashOverride

This comment has been minimized.

Show comment
Hide comment
@OtherCrashOverride

OtherCrashOverride Apr 16, 2016

Before any more academic papers are cited, we should probably get back to the topic of discussion: "State / Direction of C# as a High-Performance Language"

My suggestions was effectively "Make unsafe code easier to use and more of a first class citizen." I will add that this includes "inline assembly" of .Net bytecode.

Other than that, I do not feel that making C# more burdened with special rules and special cases is the optimal solution (re: destructable types). Instead, modifying the runtime itself could transparently extend performance gains to all .Net languages. As an example of this, a reference counting strategy was cited. It is outside the scope to define that strategy here.

Also mentioned was that it is now possible to actually implement and test these ideas since everything is now open source. That is the point in time for discussion about implementation strategies and citing of academic papers. We can transform theory to reality and have a concrete representation of what works, what doesn't, what can be done better, and what it will break. This would be far more useful than just debating theory in GitHub comments.

Furthermore, I am explicitly against adding SSE4 intrinsics to the C# language. C# should not be biased toward any specific processor instruction set.

[edit]
To clarify: one of the points I am attempting to illustrate is that as we continue to expand across more cores and more threads using those cores, the cost of stopping all of them to do live object tracing becomes increasingly more expensive. This is where the reference counting discussion comes into play. Its a suggestion not a proposal.

OtherCrashOverride commented Apr 16, 2016

Before any more academic papers are cited, we should probably get back to the topic of discussion: "State / Direction of C# as a High-Performance Language"

My suggestions was effectively "Make unsafe code easier to use and more of a first class citizen." I will add that this includes "inline assembly" of .Net bytecode.

Other than that, I do not feel that making C# more burdened with special rules and special cases is the optimal solution (re: destructable types). Instead, modifying the runtime itself could transparently extend performance gains to all .Net languages. As an example of this, a reference counting strategy was cited. It is outside the scope to define that strategy here.

Also mentioned was that it is now possible to actually implement and test these ideas since everything is now open source. That is the point in time for discussion about implementation strategies and citing of academic papers. We can transform theory to reality and have a concrete representation of what works, what doesn't, what can be done better, and what it will break. This would be far more useful than just debating theory in GitHub comments.

Furthermore, I am explicitly against adding SSE4 intrinsics to the C# language. C# should not be biased toward any specific processor instruction set.

[edit]
To clarify: one of the points I am attempting to illustrate is that as we continue to expand across more cores and more threads using those cores, the cost of stopping all of them to do live object tracing becomes increasingly more expensive. This is where the reference counting discussion comes into play. Its a suggestion not a proposal.

@xoofx

This comment has been minimized.

Show comment
Hide comment
@xoofx

xoofx Apr 16, 2016

Member

This would be far more useful than just debating theory in GitHub comments.

Precisely, I have contributed to the prototype for class alloc on the stack linked above and I'm a strong supporter to encourage .NET devs performance enthusiasts to build such prototypes.

Member

xoofx commented Apr 16, 2016

This would be far more useful than just debating theory in GitHub comments.

Precisely, I have contributed to the prototype for class alloc on the stack linked above and I'm a strong supporter to encourage .NET devs performance enthusiasts to build such prototypes.

@svick

This comment has been minimized.

Show comment
Hide comment
@svick

svick Apr 16, 2016

Contributor

@OtherCrashOverride

A reference should be a handle and therefore abstract whether the storage is stack or heap. The runtime references the handle, not the actual pointer, requiring only the handle itself be updated when promoting a stack pointer to a heap pointer. Again, this is all theoretical.

So, to avoid allocating and deallocating an object on the heap, you instead have to allocate and deallocate a handle on the heap? That doesn't sound like a great win, especially since it also makes every dereference about twice as slow as previously.

Contributor

svick commented Apr 16, 2016

@OtherCrashOverride

A reference should be a handle and therefore abstract whether the storage is stack or heap. The runtime references the handle, not the actual pointer, requiring only the handle itself be updated when promoting a stack pointer to a heap pointer. Again, this is all theoretical.

So, to avoid allocating and deallocating an object on the heap, you instead have to allocate and deallocate a handle on the heap? That doesn't sound like a great win, especially since it also makes every dereference about twice as slow as previously.

@jonathanmarston

This comment has been minimized.

Show comment
Hide comment
@jonathanmarston

jonathanmarston Apr 16, 2016

@OtherCrashOverride

I am explicitly against adding SSE4 intrinsics to the C# language. C# should not be biased toward any specific processor instruction set.

I agree that intrinsics shouldn't be created for all SSE4 instructions just for the sake of allowing access to SSE4 from C#, but there are some very common bit-level operations for which it would make sense to add highly optimized implementations to the framework anyway. For these, why not make them JIT to a single instruction, if available on the executing platform?

POPCNT especially comes to mind here. Having a BitCount method for numeric types built in to the framework makes sense even without intrinsics (just look how many times MS has reimplemented this algorithm across their own codebases alone). Once it's added at the framework level, why not put the little bit of extra effort in to optimize it away during the JIT when the instructions are available to do so?

jonathanmarston commented Apr 16, 2016

@OtherCrashOverride

I am explicitly against adding SSE4 intrinsics to the C# language. C# should not be biased toward any specific processor instruction set.

I agree that intrinsics shouldn't be created for all SSE4 instructions just for the sake of allowing access to SSE4 from C#, but there are some very common bit-level operations for which it would make sense to add highly optimized implementations to the framework anyway. For these, why not make them JIT to a single instruction, if available on the executing platform?

POPCNT especially comes to mind here. Having a BitCount method for numeric types built in to the framework makes sense even without intrinsics (just look how many times MS has reimplemented this algorithm across their own codebases alone). Once it's added at the framework level, why not put the little bit of extra effort in to optimize it away during the JIT when the instructions are available to do so?

@OtherCrashOverride

This comment has been minimized.

Show comment
Hide comment
@OtherCrashOverride

OtherCrashOverride Apr 16, 2016

@svick

So, to avoid allocating and deallocating an object on the heap, you instead have to allocate and deallocate a handle on the heap?

This discussion is not intended to define a specification for memory management. It is hoped that an implementer of such a system would be competent and intelligent enough to avoid obvious and naive design issues.

@jonathanmarston
They key point is that adding something like a BitCount method is a framework and/or runtime modification. It should not require any modification to the C# language to support. I am certainly in favor of adding first class SIMD support to the framework/runtime; however, I do not believe it has a place as part of the C# language itself as SSE4 (or other architecture) intrinsics would do.

OtherCrashOverride commented Apr 16, 2016

@svick

So, to avoid allocating and deallocating an object on the heap, you instead have to allocate and deallocate a handle on the heap?

This discussion is not intended to define a specification for memory management. It is hoped that an implementer of such a system would be competent and intelligent enough to avoid obvious and naive design issues.

@jonathanmarston
They key point is that adding something like a BitCount method is a framework and/or runtime modification. It should not require any modification to the C# language to support. I am certainly in favor of adding first class SIMD support to the framework/runtime; however, I do not believe it has a place as part of the C# language itself as SSE4 (or other architecture) intrinsics would do.

@msedi

This comment has been minimized.

Show comment
Hide comment
@msedi

msedi Apr 16, 2016

I also agree that there should be no SSE or other intrinsics in the IL. Moreover there should be a more generic abstraction of this command to let the JITter recognize that there might be some, for example vector operation. Currently writing a simple addition of two arrays produced too much IL code.

But I must admit I'm currently not aware of what the JITter really makes out of this code.

Another things is, that it really might be helpful to go back to some inline IL, something like the asm keyword in C/C++. Event Serge Lidin and some other guys wrote some pre- and postprocessor for C# to feed in IL code. But in fact I don't like this back and forth assembling and disassembling just to get native .NET things in my code.

msedi commented Apr 16, 2016

I also agree that there should be no SSE or other intrinsics in the IL. Moreover there should be a more generic abstraction of this command to let the JITter recognize that there might be some, for example vector operation. Currently writing a simple addition of two arrays produced too much IL code.

But I must admit I'm currently not aware of what the JITter really makes out of this code.

Another things is, that it really might be helpful to go back to some inline IL, something like the asm keyword in C/C++. Event Serge Lidin and some other guys wrote some pre- and postprocessor for C# to feed in IL code. But in fact I don't like this back and forth assembling and disassembling just to get native .NET things in my code.

@HaloFour

This comment has been minimized.

Show comment
Hide comment
@HaloFour

HaloFour Aug 31, 2016

@martinvahi

Rather than trying to take a general purpose programming language like C# and trying to shoe-horn it into a very constrained and specific space such as highly automatically concurrent programming, why don't you actually use one of the various languages designed for that job? It's not like any of those languages let you write whatever you want and it's magickally concurrent/faster. What those languages do is force you onto the rails where the program could be parallelized. But if the tasks aren't sufficiently independent you won't realize any benefits regardless of the language or compiler. The act of being concurrent adds a lot of overhead, regardless of the language, OS, hardware, etc.

Try F# with Akka if you want something CLR-based.

HaloFour commented Aug 31, 2016

@martinvahi

Rather than trying to take a general purpose programming language like C# and trying to shoe-horn it into a very constrained and specific space such as highly automatically concurrent programming, why don't you actually use one of the various languages designed for that job? It's not like any of those languages let you write whatever you want and it's magickally concurrent/faster. What those languages do is force you onto the rails where the program could be parallelized. But if the tasks aren't sufficiently independent you won't realize any benefits regardless of the language or compiler. The act of being concurrent adds a lot of overhead, regardless of the language, OS, hardware, etc.

Try F# with Akka if you want something CLR-based.

@SunnyWar

This comment has been minimized.

Show comment
Hide comment
@SunnyWar

SunnyWar Aug 31, 2016

We have had for years "solvers", such as the Microsoft Solver Foundation that can find optimal or near optimal solutions to a large variety of problems. I hope something like this already exists in the compiler. If not, perhaps it should be. I bet a there are a large number of decisions the compiler makes can be expressed as a Model with Parameters, Decisions, Goals[{Minimize|Maximize}], and Constraints. I also bet the solutions a solver comes up will be as good or better than anything a human has hardcoded.

SunnyWar commented Aug 31, 2016

We have had for years "solvers", such as the Microsoft Solver Foundation that can find optimal or near optimal solutions to a large variety of problems. I hope something like this already exists in the compiler. If not, perhaps it should be. I bet a there are a large number of decisions the compiler makes can be expressed as a Model with Parameters, Decisions, Goals[{Minimize|Maximize}], and Constraints. I also bet the solutions a solver comes up will be as good or better than anything a human has hardcoded.

@martinvahi

This comment has been minimized.

Show comment
Hide comment
@martinvahi

martinvahi Sep 1, 2016

@HaloFour

Rather than trying to take a general purpose programming language like C# and trying to shoe-horn it into a very constrained and specific space such as highly automatically concurrent programming, why don't you actually use one of the various languages designed for that job?

Thank You and the @SunnyWar and others for the answers.

My last proposal was that instead of doing the 2 parts, speed-optimized part and development-comfort optimized part, in 2 different programming languages, the 2 parts might be written both in C#, except that the development-comfort optimized part can use all of C# and the speed-optimized part is allowed to use only a sub-set of C#. The benefit of such a solution over the Ruby-gems-written-in-C is that the development-comfort optimized part will become even more comfortable due to the possibility to avoid writing all the glue that it takes to glue together components that have been written in different programming languages. The speed-optimized part will probably be error prone and laborious to write no matter, what programming language and automation is used. As a matter of fact, the solution is even modular in a sense that different speed optimizations might allow the use of different sub-sets of the C# language. Developers of speed-optimized parts might have to choose, which type of auto-optimizer they want to use and then write their speed-optimized part by using only that subset of C# that is supported by the optimizer that they want to use.

This also allows experimentation at the academic side and gradual migration of the academic results to practice.

martinvahi commented Sep 1, 2016

@HaloFour

Rather than trying to take a general purpose programming language like C# and trying to shoe-horn it into a very constrained and specific space such as highly automatically concurrent programming, why don't you actually use one of the various languages designed for that job?

Thank You and the @SunnyWar and others for the answers.

My last proposal was that instead of doing the 2 parts, speed-optimized part and development-comfort optimized part, in 2 different programming languages, the 2 parts might be written both in C#, except that the development-comfort optimized part can use all of C# and the speed-optimized part is allowed to use only a sub-set of C#. The benefit of such a solution over the Ruby-gems-written-in-C is that the development-comfort optimized part will become even more comfortable due to the possibility to avoid writing all the glue that it takes to glue together components that have been written in different programming languages. The speed-optimized part will probably be error prone and laborious to write no matter, what programming language and automation is used. As a matter of fact, the solution is even modular in a sense that different speed optimizations might allow the use of different sub-sets of the C# language. Developers of speed-optimized parts might have to choose, which type of auto-optimizer they want to use and then write their speed-optimized part by using only that subset of C# that is supported by the optimizer that they want to use.

This also allows experimentation at the academic side and gradual migration of the academic results to practice.

@HaloFour

This comment has been minimized.

Show comment
Hide comment
@HaloFour

HaloFour Sep 1, 2016

@martinvahi

The fact that you conflate "speed-optimized" with parallelism highlights the fact that you don't understand the problem space. They're not the same thing and they're not written in the same manner. Such a "dialect" of "C#" would be so fundamentally different from C# that it simply wouldn't be C#. For reference, I suggest you look into Microsoft Research language Axum.

What you suggest simply doesn't make sense. The languages and tools to write parallel programs already exist, including on the CLR. Microsoft has no reason to fork off a dialect of C# just to hack it up into something unrecognizable from C#. You may feel free to give it a try, though.

HaloFour commented Sep 1, 2016

@martinvahi

The fact that you conflate "speed-optimized" with parallelism highlights the fact that you don't understand the problem space. They're not the same thing and they're not written in the same manner. Such a "dialect" of "C#" would be so fundamentally different from C# that it simply wouldn't be C#. For reference, I suggest you look into Microsoft Research language Axum.

What you suggest simply doesn't make sense. The languages and tools to write parallel programs already exist, including on the CLR. Microsoft has no reason to fork off a dialect of C# just to hack it up into something unrecognizable from C#. You may feel free to give it a try, though.

@antiufo

This comment has been minimized.

Show comment
Hide comment
@antiufo

antiufo Sep 15, 2016

Shameless plug: in case someone is interested, here's a tool I recently published, roslyn-linq-rewrite, that rewrites the syntax trees of LINQ expressions and turns them into allocation-free, procedural code, by setting a custom compilerName in project.json.

antiufo commented Sep 15, 2016

Shameless plug: in case someone is interested, here's a tool I recently published, roslyn-linq-rewrite, that rewrites the syntax trees of LINQ expressions and turns them into allocation-free, procedural code, by setting a custom compilerName in project.json.

@temporaryfile

This comment has been minimized.

Show comment
Hide comment
@temporaryfile

temporaryfile Sep 15, 2016

It's gotten clearer and clearer to me that most of C#'s performance problems are really framework performance problems. And it is far easier for me to rewrite the remaining parts of the .NET Framework for my competitive advantage than to complain to website people that Stream async methods need buffer pointer overloads.

It's strange that many of the same ones who defend the most awkward points of C#'s "striving for correctness" usually see nothing wrong with using signed integers for 0-based absolute pointers. It feels like being on an MMO forum watching the debate between crafters and raiders.

temporaryfile commented Sep 15, 2016

It's gotten clearer and clearer to me that most of C#'s performance problems are really framework performance problems. And it is far easier for me to rewrite the remaining parts of the .NET Framework for my competitive advantage than to complain to website people that Stream async methods need buffer pointer overloads.

It's strange that many of the same ones who defend the most awkward points of C#'s "striving for correctness" usually see nothing wrong with using signed integers for 0-based absolute pointers. It feels like being on an MMO forum watching the debate between crafters and raiders.

@chrisaut

This comment has been minimized.

Show comment
Hide comment
@chrisaut

chrisaut Sep 16, 2016

@antiufo this looks very very nice (disclaimer, I haven't actually tried it).

But I fear it will reach a very limited audience being maintained as a separate tool. Do you have any plans to propose/integrate all or parts of this into the "mainstream" compiler here?

Anything you learned from it that would suggests it cannot/should not be done in the general case?

chrisaut commented Sep 16, 2016

@antiufo this looks very very nice (disclaimer, I haven't actually tried it).

But I fear it will reach a very limited audience being maintained as a separate tool. Do you have any plans to propose/integrate all or parts of this into the "mainstream" compiler here?

Anything you learned from it that would suggests it cannot/should not be done in the general case?

@antiufo

This comment has been minimized.

Show comment
Hide comment
@antiufo

antiufo Sep 16, 2016

@chrisaut based on previous discussions, there seem to be three options:

Escape analysis only (JIT)

  • Func<> and <>_DisplayClass allocations could be made on the stack
  • The price of virtual calls (GetEnumerator, Func<>.Invoke()) would still be paid, no inlining
  • Type checks for ICollection and other special cases would still be paid

LINQ calls are optimized by the JIT

An [Intrinsic] attribute could be applied to Enumerable.Where<>, Select<> and so on. When the jitter sees calls to these methods, it could produce optimized procedural code based on its knowledge of how Where, Select methods should behave. By adding [Intrinsic] to your Where-like method, you are guaranteeing that this is indeed the behavior you want (#2385).

  • The IL code is still going to contain things like new <>DisplayClass. The jitter would have to collect all of these details and determine how to output the correct procedural machine code (it cannot rely on the syntax trees that roslyn is instead aware of). This means that the implementation could be quite ugly/complicated.

LINQ calls are optimized by the compiler

When the roslyn compiler sees a chain of calls to methods marked with [Intrinsic] and well-known names, it could produce optimized procedural code on its own.

  • Old assemblies would not benefit from this, they have to be recompiled.
  • Changing the Where/Select implementation would require recompiling the assemblies.

While I see that this last point might seem a bit problematic, this is not the first time we sacrifice the freedom to retroactively change the implementation of some very-straightforward code for the sake of performance (like [NonVersionable] properties/methods).
While the current implementation of many LINQ methods is non-trivial and could change in the future, most of the complexity is due to optimizations, special cases, type checks, management of IEnumerators.

After all, it's not unreasonable to think that the procedural implementation of a Where+Count over a non-null array is always going to be

var count = 0;
for (var i = 0; i < arr.Length; i++) // or foreach
{
    if(condition...arr[i]...) count++;
}

The code produced by the compiler might change slightly across compiler versions, but this already happens with other language features.

Of course, there's no need to optimize all LINQ methods. Complex ones like GroupBy could still be compiled in the traditional way (closure+func+call).

antiufo commented Sep 16, 2016

@chrisaut based on previous discussions, there seem to be three options:

Escape analysis only (JIT)

  • Func<> and <>_DisplayClass allocations could be made on the stack
  • The price of virtual calls (GetEnumerator, Func<>.Invoke()) would still be paid, no inlining
  • Type checks for ICollection and other special cases would still be paid

LINQ calls are optimized by the JIT

An [Intrinsic] attribute could be applied to Enumerable.Where<>, Select<> and so on. When the jitter sees calls to these methods, it could produce optimized procedural code based on its knowledge of how Where, Select methods should behave. By adding [Intrinsic] to your Where-like method, you are guaranteeing that this is indeed the behavior you want (#2385).

  • The IL code is still going to contain things like new <>DisplayClass. The jitter would have to collect all of these details and determine how to output the correct procedural machine code (it cannot rely on the syntax trees that roslyn is instead aware of). This means that the implementation could be quite ugly/complicated.

LINQ calls are optimized by the compiler

When the roslyn compiler sees a chain of calls to methods marked with [Intrinsic] and well-known names, it could produce optimized procedural code on its own.

  • Old assemblies would not benefit from this, they have to be recompiled.
  • Changing the Where/Select implementation would require recompiling the assemblies.

While I see that this last point might seem a bit problematic, this is not the first time we sacrifice the freedom to retroactively change the implementation of some very-straightforward code for the sake of performance (like [NonVersionable] properties/methods).
While the current implementation of many LINQ methods is non-trivial and could change in the future, most of the complexity is due to optimizations, special cases, type checks, management of IEnumerators.

After all, it's not unreasonable to think that the procedural implementation of a Where+Count over a non-null array is always going to be

var count = 0;
for (var i = 0; i < arr.Length; i++) // or foreach
{
    if(condition...arr[i]...) count++;
}

The code produced by the compiler might change slightly across compiler versions, but this already happens with other language features.

Of course, there's no need to optimize all LINQ methods. Complex ones like GroupBy could still be compiled in the traditional way (closure+func+call).

@torgabor

This comment has been minimized.

Show comment
Hide comment
@torgabor

torgabor Sep 21, 2016

@antiufo Just my 2 cents regarding the 3 possible solutions: I think the 3rd solution is the best, for the following reasons:

  • As you stated, solution 1 doesn't solve the problem completely, however with even more compiler machinery in the JIT(stack allocation, devirtualization, inlining),the constructs could be reduced to a soup of plain instructions the cost could be eliminated. However this requires a lot of development effort, and it's not clear to me that this is feasible.
  • Solution also seems far from trivial, and also a bit inelegant, due to how lambdas are implemented. In the following code(which is not LINQ but would be candidate for a similar optimization):
            int x = 2;
            Action lamda = () => x++;
            lamda();
            Console.WriteLine(x);

The compiler moves out x from the stack onto a display class, and Console.WriteLine gets passed the field of the display class as a parameter, etc.. The compiler would need to recognize these patterns and transform them back to stack allocation, and loops. So in this case the compiler would apply source transformations that actively work against us, that we have to undo in the JIT.

  • Solution 3' only disadvantage I think, is that it could change the semantics of the source (which is obviously should not be allowed) if the implementer is not careful.

torgabor commented Sep 21, 2016

@antiufo Just my 2 cents regarding the 3 possible solutions: I think the 3rd solution is the best, for the following reasons:

  • As you stated, solution 1 doesn't solve the problem completely, however with even more compiler machinery in the JIT(stack allocation, devirtualization, inlining),the constructs could be reduced to a soup of plain instructions the cost could be eliminated. However this requires a lot of development effort, and it's not clear to me that this is feasible.
  • Solution also seems far from trivial, and also a bit inelegant, due to how lambdas are implemented. In the following code(which is not LINQ but would be candidate for a similar optimization):
            int x = 2;
            Action lamda = () => x++;
            lamda();
            Console.WriteLine(x);

The compiler moves out x from the stack onto a display class, and Console.WriteLine gets passed the field of the display class as a parameter, etc.. The compiler would need to recognize these patterns and transform them back to stack allocation, and loops. So in this case the compiler would apply source transformations that actively work against us, that we have to undo in the JIT.

  • Solution 3' only disadvantage I think, is that it could change the semantics of the source (which is obviously should not be allowed) if the implementer is not careful.
@jcdickinson

This comment has been minimized.

Show comment
Hide comment
@jcdickinson

jcdickinson Sep 21, 2016

Concerning @antiufo's scenario 1 (I really don't like [Intrinsic]), I had an attempt at trying to get the JITter to do the "right" thing with Where(), by implementing the whole lot using structs. The disassembly (4.6.1, x64, release) shows multiple call ops, indicating that the desired inlining is not occurring. Using an array instead of IEnumerable seems to improve things, but gets nowhere near the desired results (and still has multiple call ops in the foreach block).

jcdickinson commented Sep 21, 2016

Concerning @antiufo's scenario 1 (I really don't like [Intrinsic]), I had an attempt at trying to get the JITter to do the "right" thing with Where(), by implementing the whole lot using structs. The disassembly (4.6.1, x64, release) shows multiple call ops, indicating that the desired inlining is not occurring. Using an array instead of IEnumerable seems to improve things, but gets nowhere near the desired results (and still has multiple call ops in the foreach block).

@xen2

This comment has been minimized.

Show comment
Hide comment
@xen2

xen2 Sep 21, 2016

Solution3, if implemented, would need lot of tedious manual work, and would work only on LINQ. It would feel strange that certain method optimizations are hardcoded in Roslyn, while a user wouldn't be able to do the same with his own code?
Of course, making it extensible is possible (compiler plugins), but that would be huge complexity, lot of chance of mistakes in each LINQ method optimizer, etc... so I would rather avoid this approach. Or maybe it should be quite generic, like "IL" optimizer/passes plugin that could do any kind of optimizations, like LLVM? (I would be happy with that, as we already have to do IL post processing to various extents in our game engine)

Solution1 seems much better to have in general (lot of code coudl benefit from it, and should be "easier" to prove and make safe in general), and it seems some effort is already started in that direction, cf https://twitter.com/xjoeduffyx/status/771023674694524932

xen2 commented Sep 21, 2016

Solution3, if implemented, would need lot of tedious manual work, and would work only on LINQ. It would feel strange that certain method optimizations are hardcoded in Roslyn, while a user wouldn't be able to do the same with his own code?
Of course, making it extensible is possible (compiler plugins), but that would be huge complexity, lot of chance of mistakes in each LINQ method optimizer, etc... so I would rather avoid this approach. Or maybe it should be quite generic, like "IL" optimizer/passes plugin that could do any kind of optimizations, like LLVM? (I would be happy with that, as we already have to do IL post processing to various extents in our game engine)

Solution1 seems much better to have in general (lot of code coudl benefit from it, and should be "easier" to prove and make safe in general), and it seems some effort is already started in that direction, cf https://twitter.com/xjoeduffyx/status/771023674694524932

@mattwarren

This comment has been minimized.

Show comment
Hide comment
@mattwarren

mattwarren Sep 21, 2016

Joe Duffy has a great talk that covers (amongst other things) what would need to be done to optimise LINQ, escape analysis, stack allocation etc (slides are available if you don't want to watch the whole thing)

mattwarren commented Sep 21, 2016

Joe Duffy has a great talk that covers (amongst other things) what would need to be done to optimise LINQ, escape analysis, stack allocation etc (slides are available if you don't want to watch the whole thing)

@msedi

This comment has been minimized.

Show comment
Hide comment
@msedi

msedi Sep 22, 2016

I have a general question about for loops. I was trying to optimize my mathematical routines which are mostly done on arrays of different types. As discussed very very often here the problem is that I cannot have pointers to generics so I had to duplicate my functions for all primitive types. However I have accepted - or better resigned - on this topic since it seems it will never come.

Nevertheless I have also tried the same - also discussed here - with IL code which works fine for my solution, but also there it would be nice to have some IL inline assembler, like the old asm {} keyword in C++, also here I guess it will never come.

What does currently bother me is when looking how a for loop is "converted" into IL code. From my old assembler knowledge there was the LOOP keyword where a simple addition was done in the AX,BC with the CX as count register. In IL it seems that all loops are converted to IF...GOTO statements which I feel very umcomfortable with, since I think no jitter will ever recognize that an IF...GOTO statement can be converted to the LOOP construct in x86 architecture. I guess that doing the loops with IF...GOTO costs much more than the x86 LOOP. What does the jitter do to optimize loops?

I'm I right or wrong on this topic.?

msedi commented Sep 22, 2016

I have a general question about for loops. I was trying to optimize my mathematical routines which are mostly done on arrays of different types. As discussed very very often here the problem is that I cannot have pointers to generics so I had to duplicate my functions for all primitive types. However I have accepted - or better resigned - on this topic since it seems it will never come.

Nevertheless I have also tried the same - also discussed here - with IL code which works fine for my solution, but also there it would be nice to have some IL inline assembler, like the old asm {} keyword in C++, also here I guess it will never come.

What does currently bother me is when looking how a for loop is "converted" into IL code. From my old assembler knowledge there was the LOOP keyword where a simple addition was done in the AX,BC with the CX as count register. In IL it seems that all loops are converted to IF...GOTO statements which I feel very umcomfortable with, since I think no jitter will ever recognize that an IF...GOTO statement can be converted to the LOOP construct in x86 architecture. I guess that doing the loops with IF...GOTO costs much more than the x86 LOOP. What does the jitter do to optimize loops?

I'm I right or wrong on this topic.?

@bbarry

This comment has been minimized.

Show comment
Hide comment
@bbarry

bbarry Sep 22, 2016

@msedi by building all loops in IL roughly the same way the jitter can search for a common pattern to optimize. Indeed the core CLR (and I assume desktop as well) does identify a number of such possible loops. For example:

https://github.com/dotnet/coreclr/blob/393b0a8262e5e4f1fed27494af3aac8778616d4c/src/jit/optimizer.cpp#L1195

Try to find loops that have an iterator (i.e. for-like loops) "for (init; test; incr){ ... }"
We have the following restrictions:

  1. The loop condition must be a simple one i.e. only one JTRUE node
  2. There must be a loop iterator (a local var) that is incremented (decremented or lsh, rsh, mul) with a constant value
  3. The iterator is incremented exactly once
  4. The loop condition must use the iterator.

bbarry commented Sep 22, 2016

@msedi by building all loops in IL roughly the same way the jitter can search for a common pattern to optimize. Indeed the core CLR (and I assume desktop as well) does identify a number of such possible loops. For example:

https://github.com/dotnet/coreclr/blob/393b0a8262e5e4f1fed27494af3aac8778616d4c/src/jit/optimizer.cpp#L1195

Try to find loops that have an iterator (i.e. for-like loops) "for (init; test; incr){ ... }"
We have the following restrictions:

  1. The loop condition must be a simple one i.e. only one JTRUE node
  2. There must be a loop iterator (a local var) that is incremented (decremented or lsh, rsh, mul) with a constant value
  3. The iterator is incremented exactly once
  4. The loop condition must use the iterator.
@svick

This comment has been minimized.

Show comment
Hide comment
@svick

svick Sep 22, 2016

Contributor

@msedi Apparently, LOOP has been slower than jump since 80486.

And finding loops is easy for the JIT, you just have to find a cycle in the control flow graph generated from the IL.

Contributor

svick commented Sep 22, 2016

@msedi Apparently, LOOP has been slower than jump since 80486.

And finding loops is easy for the JIT, you just have to find a cycle in the control flow graph generated from the IL.

@msedi

This comment has been minimized.

Show comment
Hide comment
@msedi

msedi Sep 22, 2016

@bbarry && @svick : Thanks for the explanations. That helps.

msedi commented Sep 22, 2016

@bbarry && @svick : Thanks for the explanations. That helps.

@andre-ss6

This comment has been minimized.

Show comment
Hide comment
@andre-ss6

andre-ss6 Sep 22, 2016

Wonderful talk by Joe Duffy. I felt happy to hear that they're [apparently] tackling all those problems we're discussing here.

And geez, I was at least impressed to hear that some applications from Microsoft (!) are 60% of the time in GC. 60%!! My god.

andre-ss6 commented Sep 22, 2016

Wonderful talk by Joe Duffy. I felt happy to hear that they're [apparently] tackling all those problems we're discussing here.

And geez, I was at least impressed to hear that some applications from Microsoft (!) are 60% of the time in GC. 60%!! My god.

@rstarkov

This comment has been minimized.

Show comment
Hide comment
@rstarkov

rstarkov Sep 22, 2016

@andre-ss6 hits the nail on the head. Of course not all performance issues are due to allocations. But unlike most performance issues, which have sane solutions in C#, if you run into 99% time spent in GC then you're pretty much stuffed.

What are your options at this stage? In C# as it stands today, pretty much the only option is to use arrays of structs. But any time you need to refer to one of those structs, you either go unsafe and use pointers, or you write extremely unreadable code. Both options are bad. If C# had AST macros, the code to access such "references" could be vastly more readable without any performance penalty added by the abstraction.

One of the bigger improvements on code that's already well-optimized comes from abandoning all the nice and convenient features like List<T>, LINQ or the foreach loop. The fact that these are expensive in tight code is unfortunate, but what is worse is that there is no way to rewrite these in a way that's comparable in readability - and that's another thing AST macros could help with.

Obviously the AST macros feature would need to be designed very carefully and would require a major time investment. But if I had a vote on the subject of the one single thing that would make fast C# less of a pain, AST macros would get my vote.

P.S. I was replying to Andre's comment from almost a month ago. What are the chances he'd comment again minutes before me?!

rstarkov commented Sep 22, 2016

@andre-ss6 hits the nail on the head. Of course not all performance issues are due to allocations. But unlike most performance issues, which have sane solutions in C#, if you run into 99% time spent in GC then you're pretty much stuffed.

What are your options at this stage? In C# as it stands today, pretty much the only option is to use arrays of structs. But any time you need to refer to one of those structs, you either go unsafe and use pointers, or you write extremely unreadable code. Both options are bad. If C# had AST macros, the code to access such "references" could be vastly more readable without any performance penalty added by the abstraction.

One of the bigger improvements on code that's already well-optimized comes from abandoning all the nice and convenient features like List<T>, LINQ or the foreach loop. The fact that these are expensive in tight code is unfortunate, but what is worse is that there is no way to rewrite these in a way that's comparable in readability - and that's another thing AST macros could help with.

Obviously the AST macros feature would need to be designed very carefully and would require a major time investment. But if I had a vote on the subject of the one single thing that would make fast C# less of a pain, AST macros would get my vote.

P.S. I was replying to Andre's comment from almost a month ago. What are the chances he'd comment again minutes before me?!

@agocke

This comment has been minimized.

Show comment
Hide comment
@agocke

agocke Sep 22, 2016

Contributor

@rstarkov Hmm, I would object to calling a codebase that's using LINQ "well-optimized." That's basically saying, "I'm not allocating anything, except for all these allocations!" :)

Contributor

agocke commented Sep 22, 2016

@rstarkov Hmm, I would object to calling a codebase that's using LINQ "well-optimized." That's basically saying, "I'm not allocating anything, except for all these allocations!" :)

@SunnyWar

This comment has been minimized.

Show comment
Hide comment
@SunnyWar

SunnyWar Sep 23, 2016

I'm happy to see the ValueTask. I hope they make it into the Dataflow blocks. I wrote a audio router a few years ago. After profiling I found it spent most of it's time in the GC cleaning up tasks....and there was nothing I could about it without completely throwing out the Dataflow pipeline (basically the guts of the whole thing).

SunnyWar commented Sep 23, 2016

I'm happy to see the ValueTask. I hope they make it into the Dataflow blocks. I wrote a audio router a few years ago. After profiling I found it spent most of it's time in the GC cleaning up tasks....and there was nothing I could about it without completely throwing out the Dataflow pipeline (basically the guts of the whole thing).

@benaadams

This comment has been minimized.

Show comment
Hide comment
@benaadams

benaadams Sep 23, 2016

Contributor

What are your options at this stage? In C# as it stands today, pretty much the only option is to use arrays of structs. But any time you need to refer to one of those structs, you either go unsafe and use pointers, or you write extremely unreadable code. Both options are bad.

@rstarkov you use Ref returns and locals in C# 7 with Visual Studio “15” Preview 4, though alas you can't use it will .NET Core currently. However, it is coming and should address this particular issue.

Contributor

benaadams commented Sep 23, 2016

What are your options at this stage? In C# as it stands today, pretty much the only option is to use arrays of structs. But any time you need to refer to one of those structs, you either go unsafe and use pointers, or you write extremely unreadable code. Both options are bad.

@rstarkov you use Ref returns and locals in C# 7 with Visual Studio “15” Preview 4, though alas you can't use it will .NET Core currently. However, it is coming and should address this particular issue.

@svick

This comment has been minimized.

Show comment
Hide comment
@svick

svick Sep 23, 2016

Contributor

@SunnyWar ValueTask makes the most sense when a method sometimes completes synchronously and sometimes asynchronously (in which case you don't have to allocate in the synchronous case). Not sure if using it would solve some general issue in Dataflow.

Were your transforms synchronous or asynchronous? If they were asynchronous, then you probably can't avoid allocating Tasks. If they were synchronous, then I'm not sure why would Dataflow allocate lots of Tasks.

Contributor

svick commented Sep 23, 2016

@SunnyWar ValueTask makes the most sense when a method sometimes completes synchronously and sometimes asynchronously (in which case you don't have to allocate in the synchronous case). Not sure if using it would solve some general issue in Dataflow.

Were your transforms synchronous or asynchronous? If they were asynchronous, then you probably can't avoid allocating Tasks. If they were synchronous, then I'm not sure why would Dataflow allocate lots of Tasks.

@agocke

This comment has been minimized.

Show comment
Hide comment
@agocke

agocke Sep 23, 2016

Contributor

@benaadams Technically the NuGet packages are available, but it'd probably require building the .NET CLI repo from source

Contributor

agocke commented Sep 23, 2016

@benaadams Technically the NuGet packages are available, but it'd probably require building the .NET CLI repo from source

@benaadams

This comment has been minimized.

Show comment
Hide comment
@benaadams

benaadams Sep 23, 2016

Contributor

@agocke compiling it is one thing and important for CI; but development work doesn't flow so well when the UI tooling doesn't understand it very well and highlights errors :-/

Contributor

benaadams commented Sep 23, 2016

@agocke compiling it is one thing and important for CI; but development work doesn't flow so well when the UI tooling doesn't understand it very well and highlights errors :-/

@agocke

This comment has been minimized.

Show comment
Hide comment
@agocke

agocke Sep 23, 2016

Contributor

@benaadams Duh, I totally forgot about the IDE :-p

Contributor

agocke commented Sep 23, 2016

@benaadams Duh, I totally forgot about the IDE :-p

@bbarry

This comment has been minimized.

Show comment
Hide comment
@bbarry

bbarry Sep 23, 2016

@SunnyWar, @svick also if you aren't careful in dataflow you can wind up with many allocations related to closures, lambdas and function pointers even if they were synchronous (it seems pretty impossible to avoid at least some in any case; sometimes it might even be reasonable to hold on to references intentionally to lighten GC in particular places).

bbarry commented Sep 23, 2016

@SunnyWar, @svick also if you aren't careful in dataflow you can wind up with many allocations related to closures, lambdas and function pointers even if they were synchronous (it seems pretty impossible to avoid at least some in any case; sometimes it might even be reasonable to hold on to references intentionally to lighten GC in particular places).

@jcdickinson

This comment has been minimized.

Show comment
Hide comment
@jcdickinson

jcdickinson Sep 26, 2016

@rstarkov ArraySegment<T> (essentially a slice) and a few home-grown extension methods can help with that, although not perfect. Also, have a look at Duffy's Slices.

jcdickinson commented Sep 26, 2016

@rstarkov ArraySegment<T> (essentially a slice) and a few home-grown extension methods can help with that, although not perfect. Also, have a look at Duffy's Slices.

@timgoodman

This comment has been minimized.

Show comment
Hide comment
@timgoodman

timgoodman Sep 29, 2016

@agocke The fact that a codebase which uses LINQ is not "well-optimized" is exactly the problem. There's no reason in principle why, at least in the more simple cases (which are probably the majority of cases), the compiler couldn't do stack allocation, in-lining, loop fusion, and so forth to produce fast, imperative code. Broadly speaking, isn't that (a big part of) why we have a compiler - so we can write expressive, maintainable code, and let the machine rewrite it as something ugly and fast?

Don't get me wrong, I'm not expecting the compiler to completely free me from having to optimize, but optimizing some of the most common uses of Linq2Objects seems like relatively low-hanging fruit that would benefit a huge number of C# devs.

timgoodman commented Sep 29, 2016

@agocke The fact that a codebase which uses LINQ is not "well-optimized" is exactly the problem. There's no reason in principle why, at least in the more simple cases (which are probably the majority of cases), the compiler couldn't do stack allocation, in-lining, loop fusion, and so forth to produce fast, imperative code. Broadly speaking, isn't that (a big part of) why we have a compiler - so we can write expressive, maintainable code, and let the machine rewrite it as something ugly and fast?

Don't get me wrong, I'm not expecting the compiler to completely free me from having to optimize, but optimizing some of the most common uses of Linq2Objects seems like relatively low-hanging fruit that would benefit a huge number of C# devs.

@timgoodman

This comment has been minimized.

Show comment
Hide comment
@timgoodman

timgoodman Sep 29, 2016

@mattwarren That Joe Duffy talk is amazing, thanks for sharing! To what degree is this work already in progress with the C# compiler, as opposed to just in experimental projects like Midori? In particular, the stuff he's talking about at around 23:00 seems a lot like what people here are asking for as far as LINQ optimizations. Is there an issue in this GitHub repo that tracks the progress on that?

timgoodman commented Sep 29, 2016

@mattwarren That Joe Duffy talk is amazing, thanks for sharing! To what degree is this work already in progress with the C# compiler, as opposed to just in experimental projects like Midori? In particular, the stuff he's talking about at around 23:00 seems a lot like what people here are asking for as far as LINQ optimizations. Is there an issue in this GitHub repo that tracks the progress on that?

@benaadams

This comment has been minimized.

Show comment
Hide comment
@benaadams

benaadams Sep 29, 2016

Contributor

@timgoodman there are things here and there dotnet/coreclr#6653

Contributor

benaadams commented Sep 29, 2016

@timgoodman there are things here and there dotnet/coreclr#6653

@timgoodman

This comment has been minimized.

Show comment
Hide comment
@timgoodman

timgoodman Sep 29, 2016

@benaadams Thanks. I guess I'm not sure why this sort of thing would be under coreclr. The kinds of changes that Joe Duffy was describing seem like compiler optimizations - shouldn't they belong in roslyn or maybe llilc?

timgoodman commented Sep 29, 2016

@benaadams Thanks. I guess I'm not sure why this sort of thing would be under coreclr. The kinds of changes that Joe Duffy was describing seem like compiler optimizations - shouldn't they belong in roslyn or maybe llilc?

@timgoodman

This comment has been minimized.

Show comment
Hide comment
@timgoodman

timgoodman Sep 29, 2016

Ah, never mind, I hadn't realized that the coreclr repo contains the JIT compiler. I guess that's where this sort of optimization would need to happen for it to apply to calls to System.Linq methods.

timgoodman commented Sep 29, 2016

Ah, never mind, I hadn't realized that the coreclr repo contains the JIT compiler. I guess that's where this sort of optimization would need to happen for it to apply to calls to System.Linq methods.

@ciplogic

This comment has been minimized.

Show comment
Hide comment
@ciplogic

ciplogic Jan 2, 2017

Great effort!

Sounds to me a bit silly to point it out, but I've noticed in the codebase only one register colorizer: LSRA (Linear Scan) one.

Is it possible to set at least for flags like AggresiveInline to a different register allocator? Maybe BackTracking (the LLVM new one) or a full register allocator?

ciplogic commented Jan 2, 2017

Great effort!

Sounds to me a bit silly to point it out, but I've noticed in the codebase only one register colorizer: LSRA (Linear Scan) one.

Is it possible to set at least for flags like AggresiveInline to a different register allocator? Maybe BackTracking (the LLVM new one) or a full register allocator?

@ciplogic

This comment has been minimized.

Show comment
Hide comment
@ciplogic

ciplogic Jan 2, 2017

It would be great to be minimal CHA or at least for sealed classes to be devirtualized or internal classes in assembly that are not overriden to be considered sealed. Use this information to devirtualize (more aggresively) methods.

Very often using ToString, and so on, cannot be safely devirtualized because there is the possiblity that the methods to be overriden. But in many assemblies private/internal classes are easier to be tracked if they are overriden, especially as assemblies do have the types and relations local.

This operation should increase by a bit the starting time, but it could be enabled into a "performance mode" tier.

ciplogic commented Jan 2, 2017

It would be great to be minimal CHA or at least for sealed classes to be devirtualized or internal classes in assembly that are not overriden to be considered sealed. Use this information to devirtualize (more aggresively) methods.

Very often using ToString, and so on, cannot be safely devirtualized because there is the possiblity that the methods to be overriden. But in many assemblies private/internal classes are easier to be tracked if they are overriden, especially as assemblies do have the types and relations local.

This operation should increase by a bit the starting time, but it could be enabled into a "performance mode" tier.

@damageboy

This comment has been minimized.

Show comment
Hide comment
@damageboy

damageboy Mar 5, 2017

Hi All,
I've recently taken some time to "Make C# Benchmark Great Again" by creating a new C# version of the k-nucleotide that beats the old version by a factor of 2 and should probably find itself in position #3 on that specific test.

I mostly did this because I couldn't see Java score better than C#, but my mental issues are not the subject of this issue.

The main improvement with this version (e.g. where most of the fat came off) is the use of ref-return dictionary instead of the .NET Dictionary<TKey, TValue>.

Try as I might I couldn't find a proper discussion of adding new data-structures / new functionality to existing data structures that would add ref-return APIs to System.Collections.Generics and I find it somewhat bizarre...

Is anyone here aware of a discussion / decision regarding this?

It feels too weird for the Roslyn team to drop this very nice new language feature and leave the whole data structures in BCL part out of it, that I feel I should ask if anyone here, whom I assume are very knowledgeable about the hi-perf situation of .NET / C# could elaborate on where we currently stand...?

damageboy commented Mar 5, 2017

Hi All,
I've recently taken some time to "Make C# Benchmark Great Again" by creating a new C# version of the k-nucleotide that beats the old version by a factor of 2 and should probably find itself in position #3 on that specific test.

I mostly did this because I couldn't see Java score better than C#, but my mental issues are not the subject of this issue.

The main improvement with this version (e.g. where most of the fat came off) is the use of ref-return dictionary instead of the .NET Dictionary<TKey, TValue>.

Try as I might I couldn't find a proper discussion of adding new data-structures / new functionality to existing data structures that would add ref-return APIs to System.Collections.Generics and I find it somewhat bizarre...

Is anyone here aware of a discussion / decision regarding this?

It feels too weird for the Roslyn team to drop this very nice new language feature and leave the whole data structures in BCL part out of it, that I feel I should ask if anyone here, whom I assume are very knowledgeable about the hi-perf situation of .NET / C# could elaborate on where we currently stand...?

@jcouv

This comment has been minimized.

Show comment
Hide comment
@jcouv

jcouv Oct 22, 2017

Member

@ilexp C# 7.2 adds a number of performance-related features: ref readonly, readonly and ref structs (such as Span<T>). We're also considering "ref local reassignment" for a subsequent release.

I didn't seem much activity on this thread for a while. Consider closing if resolved.

Member

jcouv commented Oct 22, 2017

@ilexp C# 7.2 adds a number of performance-related features: ref readonly, readonly and ref structs (such as Span<T>). We're also considering "ref local reassignment" for a subsequent release.

I didn't seem much activity on this thread for a while. Consider closing if resolved.

@ilexp

This comment has been minimized.

Show comment
Hide comment
@ilexp

ilexp Oct 22, 2017

@jcouv Yep, have been excitedly watching the new developments in C# and they definitely address some of the points. Others still remain to be discussed or addressed, but the big unsafe / slice / span part is done and discussion has been diverted to the individual issues in CoreCLR and CSharpLang. Closing this, looking forward to future improvements.

ilexp commented Oct 22, 2017

@jcouv Yep, have been excitedly watching the new developments in C# and they definitely address some of the points. Others still remain to be discussed or addressed, but the big unsafe / slice / span part is done and discussion has been diverted to the individual issues in CoreCLR and CSharpLang. Closing this, looking forward to future improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment