Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft PR for CoreCLR vectorized sorting #33152

Closed
wants to merge 1 commit into from

Conversation

damageboy
Copy link
Contributor

@damageboy damageboy commented Mar 4, 2020

Hi everyone.

This is the culmination of lots of lonely nights of research and what has become (also) a separate nuget packge for sorting primitives with Avx2 intrinsitcs.

I'm submitting this for high-level discussion about the approach and if this sort of code even fits inside something like the dotnet runtime, rather than discussion on the code style and the little duplicated code that still exists between this PR and the current Span based Array.Sort implementation.

The current PR only deals with the Int32 case, although there is no reason why other, very similar versions would not handle all the rest of the native types supported by .NET/C#.

The entire concept and optimization journey is depicted in quite a lot of detail and even with animations in my blog post series "This goes to 11". Some of the more convoluted pieces of optimized code+approach will be described in parts 4,5 or this series which I intend to release this weekend.

In general, this provides a 11x-ish speedup for the Int32 sorting in pure unsafe, intrinsics + AVX2 C# code. To be clear, this means that a 10M integer sort that usually takes around 800ms on most modern machines becomes a 70ish ms ordeal. I'll immediately cut to the chase and describe what I perceive to be the potential pain points of this approach as a default implementation inside of CoreCLR:

  • Relies on AVX2
    • Doesn't work out of the box with R2R images
    • Code duplication with the "normal" Scalar variant
  • Unsafe
  • Pins memory
  • Substantial code size blow up, even for a single type/version like this proposed Int32 version
    • This is surely one main subject for further debate as the trade-off between unrolling and code-size vs. speed-up is completely up for discussion. Any number between no unrolling and 12x unrolling will provide a different working point for the sorting mechanics/perf speed-up.
  • Relies on a 2KB lookup table per size. For Int32, (additional tables will be required for per word-size)
    • It is important to make note that the table is shared for all similarly sized types. e.g, int/uint/float can all share one lookup-table and so on...
  • Extending this to multiple types adds a lot of source code to maintain and complexity to reason about in the future.

On the other hand, there are also pros:

  • Runs 11x faster on Intel/AMD machines on almost any substantial input size.

The current code employs a hybrid approach, much like introspective sort does, that is combined of:

  • Sorting small (up to 128 elements) arrays with Bitonic Sort
    • Generated with T4 templates
  • Heap-Sorting for bad edge cases (just like IntroSort does today)
  • Dealing with partitions using vectorized sorting in two functions:
    • Anywhere between 112-152 elements using a special vectorized partitioning function
    • >= 152 elements with an uber-optimized and unrolled version
  • Pivot selection remains "median of 3" for this PR, although I have a research version that seems promising in that it selects better pivots at the same running time, which I am still toying with.

Here is a teaser for my results when running on a modern AMD 3950x machine although my blog posts do a much better job in providing the various results:
image

I obviously expect a lot of discussion about what, if any of this, should ever go into CoreCLR, and am happy to change my approach to provide some (even if lower) speed-up for all CoreCLR users rather than users of my niche library.
I really don't know where to start with a review like this, so... here goes?

/cc: @jkotas @stephentoub @tannergooding @adamsitnik @GrabYourPitchforks @benaadams

I should mention that there is still one (1) outstanding caveat within the Int32 sorting space that I need to take care of: this has to do with accepting memory/pointers that are not natively aligned to Int32 size (4 bytes). There is a single (substantial) optimization that would have to be skipped in the beginning of each partitioning call, but that is a rather minor issue to deal with in general, and would actually make reviewing the code+approach harder at this point, so I actively precluded myself from handling that caveat in this draft PR.

@tannergooding
Copy link
Member

@tannergooding tannergooding commented Mar 4, 2020

@damageboy, did you happen to collect any numbers on the performance of just using Vector128<T> rather than Vector256<T>? Glancing through the original paper that inspired you, they seem to indicate that the 256-bit version provides roughly a 5% increase but at the cost of double the lookup table size (but admittedly, they don't seem to go into more detail and it wasn't clear if the 5% they refer to was in reference to the 128-bit version they talk about at the very start of the paper).

{
fixed (int* startPtr = &Unsafe.As<T, int>(ref keys[0]))
{
var sorter = new VectorizedSort.VectorizedUnstableSortInt32(startPtr, keys.Length);
Copy link
Member

@tannergooding tannergooding Mar 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading through the paper and blog, it looks like the algorithms for different sizes are basically the same, just with different lookup tables and increment sizes. Is that correct?

Copy link
Contributor Author

@damageboy damageboy Mar 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that by different sizes you mean different types?

If so, then:

  • unsigned sorting requires two lookup-tables (at least according to my plan), per size, but one of those tables is the one used by signed sorting.
    • I can get into this when we want to dive into my "plans" for unsigned sorting.
  • float would share the table with int
  • and so one per type/size in bits, so:
    • Two more tables to support long/ulong
    • an so on per 16, 8 bits

Does that answer your question?

Copy link
Member

@tannergooding tannergooding Mar 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question was more around if the primary difference between int and short was just around the lookup table and not necessarily the algorithm used for sorting (other than the number of elements processes "per iteration")?

Trying to understand if supporting the other types would result in a code explosion or if most of the algorithm could be shared, just passing in different increment sizes and lookup tables (and calling a different instruction for the permutations).

Copy link
Contributor Author

@damageboy damageboy Mar 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I think all signed + floating types would:

  • definitely share the exact same principle
  • could potentially share one managed implementation
  • Would definitely blow up in native code size
  • Require a table per size (bytes and shorts are kind of annoying, TBH, but we'll get into that later)

Unsigned sorting has its own complexity, so it would definitely share 1/2 of the lookup tables and the same concepts, but perhaps different managed code.

Unfortunately, a lot of the code ends up being type-dependent because of the comparison intrinsic. The same blowup "problem" is unfortunately also true for BitonicSort.
My "issue" with code-size considerations and the potential blow-up is that there is no clear "formula" to what is acceptable. How much perf increase is worth how much blow-up in code-size? This is part of where I hope this discussion will take us...

The reality is that there are a few dozen working points in terms of various algorithm selection (for small array sorting) and unroll size tweaks that could substantially decrease the native code-size, while still providing reasonable speed-up.
It would be very helpful to try to come up with a some sort of a formula for thinking about this...

It is also probably a good time to mention that, quite surprisingly, a future AVX512 dotnet enabled runtime would lead to completely removing the look-up tables as VCOMPRESS* basically kills the need for the permutation primitive + table.

@damageboy
Copy link
Contributor Author

@damageboy damageboy commented Mar 4, 2020

@damageboy, did you happen to collect any numbers on the performance of just using Vector128<T> rather than Vector256<T>? Glancing through the original paper that inspired you, they seem to indicate that the 256-bit version provides roughly a 5% increase but at the cost of double the lookup table size (but admittedly, they don't seem to go into more detail and it wasn't clear if the 5% they refer to was in reference to the 128-bit version they talk about at the very start of the paper).

I did not test Vector128<T> in any way, but that doesn't mean I don't have an opinion about this ;):
I think I have some stronger opinions/justification for using 256-bit operations for my "take" on partitioning and this is closely related to my motivation for unrolling by so much (8x).

This is a slightly complicated argument, so I'll do my best not to butcher it:
Unlike the original paper, I went for in-place sorting. This decision led me down to a path where my approach, while being memory efficient, was severely hampered by high rate of branch mis-predictions (I cover this towards the end of the 3rd blog-post). The branch mis-predictions stem from the fact that selecting a side from which to read from next in my in-place approach (what I call double pumping) is a data-dependent branch, and the branch (mis-)predictor is not having a good time with that branch.

My (only working) mitigation to this (I tried a few...), was to unroll the code. So really, through unrolling, I'm reducing the impact of what is an almost assured 50% (ish) misprediction rate.
Going back to why I think this is related to your question/comment: my approach relies on pumping as many elements through the unrolled block before needing to hit that nasty branch.
As such, partitioning more elements with less code becomes a relatively important consideration when choosing Vector128<T> vs. Vector256<T> for in-place sorting.
I could do the same with Vector128<T>, but I would end up hitting that branch, which is not part of the original paper, twice as often, when compared to using Vector256<T>.

In addition, I ended up "compressing" the look-up tables to 2KB by using ulong entries (down from their original 8KB) and I also have a slightly (1-2%) slower version, which unfortunately relies on the now cursed PDEP where the lookup table can go all the way down to 768 bytes on Intel using PDEP, or 1024 bytes on AMD (using a PDEP replacement).

@tannergooding
Copy link
Member

@tannergooding tannergooding commented Mar 4, 2020

Thanks for the explanation! If we don't have numbers, that is fine. We just won't be able to take it into consideration during review 😄

For reference, I was interested for a few reasons:

  • Vector128 is available even downlevel so the default AOT/R2R scenarios can benefit
    • Startup can be fast while still allowing for fast sorts
    • Devices without AVX support can still benefit (including both older devices and newer low power devices)
  • Vector128 is more "stable" across a range of devices
    • Some devices have 2x128-bit pipelines rather than a dedicated 256-bit pipeline
    • Some older devices could downclock under "heavy" 256-bit workloads
    • Unaligned loads/stores can be more impactful for 256-bit payloads
  • Vector128 looks to have a smaller lookup table
    • Would reduce the base image size
  • Vector256 is generally not a 2x increase over Vector128
    • It can often be nearly identical or within a small range

@damageboy
Copy link
Contributor Author

@damageboy damageboy commented Mar 4, 2020

Thanks for the explanation! If we don't have numbers, that is fine. We just won't be able to take it into consideration during review

For reference, I was interested for a few reasons:

  • Vector128 is available even downlevel so the default AOT/R2R scenarios can benefit

    • Startup can be fast while still allowing for fast sorts
    • Devices without AVX support can still benefit (including both older devices and newer low power devices)
  • Vector128 is more "stable" across a range of devices

    • Some devices have 2x128-bit pipelines rather than a dedicated 256-bit pipeline
    • Some older devices could downclock under "heavy" 256-bit workloads
    • Unaligned loads/stores can be more impactful for 256-bit payloads
  • Vector128 looks to have a smaller lookup table

    • Would reduce the base image size
  • Vector256 is generally not a 2x increase over Vector128

    • It can often be nearly identical or within a small range

Just to be clear, a lot of sweat went into ensuring there are no unaligned loads. Almost ever.
I agree with the rest of what you wrote...

@jkotas
Copy link
Member

@jkotas jkotas commented Mar 4, 2020

This is definitely nice use of the hardware intrinsics. Thank you for doing this!

It is a lot of code and I do not think that sorting Int32s is common enough to include this amount of code in the platform.

Would you consider starting a new NuGet package so that people who need fast sorting can opt-in into it?

There are other types of optimizations that can be used to speed up sorting of large dataset, e.g. running it on multiple threads at the same time, etc.

@damageboy
Copy link
Contributor Author

@damageboy damageboy commented Mar 4, 2020

This is definitely nice use of the hardware intrinsics. Thank you for doing this!

It is a lot of code and I do not think that sorting Int32s is common enough to include this amount of code in the platform.

Would you consider starting a new NuGet package so that people who need fast sorting can opt-in into it?

There are other types of optimizations that can be used to speed up sorting of large dataset, e.g. running it on multiple threads at the same time, etc.

This already outside of the runtime repo as a nuget package:
https://www.nuget.org/packages/VxSort/0.1.5

This isn’t about int32 per-se but more as a general approach to sorting and should be considered as a “template” for how per-type vectorized sorting could end up looking. Therefore this (or another more cut down version) would end up being generated into 10 multiple per-type implementations (for all native types). There is no native type this approach does not cover as a rule.

I’m perfectly fine with keeping it as is, evolving the code outside of the runtime repo, and would completely understand if you would think this is just “too much” for the runtime to absorb and maintain.

I intentionally submitted my most extreme implementation (in terms of code size blowup/speedup) as a starting point (and quite possibly an ending point ;) for discussion about what, if at all, is possible to incorporate into the runtime.

If the bar is set at 0 machine code size increase no matter what the perf bump is (which I presume is not the case...) I would simply close the issue and be done with it. If on the other hand there is some acceptable trade-off that could be formulated for just how much code size increase is acceptable for how much speedup I would love to a) hear about it b) see if I can wiggle my implementation into that constraint.

Regardless, the general template would remain as-is in many of the overall aspects such as memory pinning, use of unsafe, reliance on lookup tables, etc.

I hope this clarifies my intention with this draft PR. I definitely agree that covering only a single type would not be very appealing.

As for parallelization, I don’t see any contradictions between speeding up single threaded sorting by 11x (or likely around 25x with AVX512 in the future) and parallelizing using multiple cores.

@jkotas
Copy link
Member

@jkotas jkotas commented Mar 4, 2020

This already outside of the runtime repo as a nuget package https://www.nuget.org/packages/VxSort/0.1.5

Perfect!

If the bar is set at 0 machine code size increase no matter what the perf bump is

We do look at multiple factors when evaluating performance improvements that come with non-trivial code size increase:

  • How many apps will benefit from this improvement? I have looked at our telemetry and I do not see sorting of primitive types to be used enough to pay this much core for it. For example, an average app out there spends 1000+x more time in UTF8 encoding / decoding that in sorting of primitive types according to our telemetry. It makes us much more likely to accept code size increase for UTF8 encoding / decoding than for sorting of primitive types.
  • Does this functionality have to be part of platform, or can it be independent package? For example, the hardware intrinsics have to be part of the platform and so we have accepted the non-trivial code size increase that comes with them.

I think it would be hard to fit this type of change into a reasonable budget given the benefits it has for typical app. Having it in independent package seems to be the best option for me.

Maybe we can help with promoting your great work (Twitter, Channel 9, …)? Send me an email if you are interested – my email address is in my github profile.

@damageboy
Copy link
Contributor Author

@damageboy damageboy commented Mar 4, 2020

This already outside of the runtime repo as a nuget package https://www.nuget.org/packages/VxSort/0.1.5

Perfect!

If the bar is set at 0 machine code size increase no matter what the perf bump is

We do look at multiple factors when evaluating performance improvements that come with non-trivial code size increase:

  • How many apps will benefit from this improvement? I have looked at our telemetry and I do not see sorting of primitive types to be used enough to pay this much core for it. For example, an average app out there spends 1000+x more time in UTF8 encoding / decoding that in sorting of primitive types according to our telemetry. It makes us much more likely to accept code size increase for UTF8 encoding / decoding than for sorting of primitive types.
  • Does this functionality have to be part of platform, or can it be independent package? For example, the hardware intrinsics have to be part of the platform and so we have accepted the non-trivial code size increase that comes with them.

I think it would be hard to fit this type of change into a reasonable budget given the benefits it has for typical app. Having it in independent package seems to be the best option for me.

Maybe we can help with promoting your great work (Twitter, Channel 9, …)? Send me an email if you are interested – my email address is in my github profile.

Thanks for the input!
I'd like to clarify a couple of additional things, in case I'm not understanding your intentions:

  • Sorting functionality (both stable and unstable) is already part of the runtime. I'm not suggesting to add sorting, I'm trying to optimize an existing API to gain and provide an 11x-ish perf speedup for users sorting operations involving primitives.
    • Leaving Array.Sort as-is simply means that the default implementation that 99% of users will ever see is 11x times slower than what can be done (for now, in my external nuget). To me this seems like a pit of failure, so to speak: everyone who needs to sort would simply use what's inside of CorLib, missing out on a lot of potential perf. The discoverability of any nuget package here would be very problematic next to a built-in Array.Sort.
  • What needs to be considered, IMO, is both the:
    • native code-size blow up, per primitive type. Even today, the single introsort C# implementation is used to generate up to 10 different primitive versions in terms of machines code (for primitive key-only sorting, as an example)...
    • The complexity blow-up: Granted, on first glance, the current code in the PR appears as though I'm dropping 2K LoC to sort int32, but in reality, A good 1500 of them are generated tables+Bitonic-Sort and there are about 400 lines of REAL code to consider here. I would also argue that I could turn those 400 lines into 500, and support with said 500 LoC all the signed primitive and floating point types. (Unsigned support would probably cost another 50-100 LoC).
      I obviously have no "proof" of the very last claim I've just made, as I have only implemented int32 thus far, but it should be easy to see how only a very small part of the current PR is truly hand-written code, as opposed to tables + generated code (from the T4 template).

As for the code-size blow-up:
I've experimented quite a lot with different working points as part of my rather long journey to 11x in terms of:

  • The exact algorithm gets used for small array sorting
  • How much to unroll by.

I think a reasonable compromise could be found here. As an anecdotal example, around this September, I was unrolling by 6x unroll with a different small-sorting algorithm and had a 6-7x speed-up but generating ~1/3 of the native code size of what I've provided in this PR. In other words: YMMV.

I could obviously go back to my evil lair, do much more brute-force research, creating a 10x10 matrix of speed-ups/code-size sometime by 2022 :). But given some sort of firmer idea/guidance of the kind of budget you'd consider reasonable for this, I might be able to converge on something faster, or give-up more quickly :)

P.S.:
Thus far I've failed to mention that this isn't about only sorting primitives. The same concept could be used/extended for the Array.Sort(keys, items) variant too. This might change how useful this is perceived/measured through telemetry or in your mind.

@jkotas
Copy link
Member

@jkotas jkotas commented Mar 5, 2020

simply means that the default implementation that 99% of users will ever see is 11x times slower than what can be done

You can say the same thing about e.g. string.IndexOf(string). There are well-known algorithms that can be used to make it a lot faster for larger inputs compare to what the implementation does today. These algorithms come with non-trivial code size and complexity. The suggestions to improve string.IndexOf(string) with these algorithms would be likely rejected for similar reasons.

@damageboy
Copy link
Contributor Author

@damageboy damageboy commented Mar 5, 2020

simply means that the default implementation that 99% of users will ever see is 11x times slower than what can be done

You can say the same thing about e.g. string.IndexOf(string). There are well-known algorithms that can be used to make it a lot faster for larger inputs compare to what the implementation does today. These algorithms come with non-trivial code size and complexity. The suggestions to improve string.IndexOf(string) with these algorithms would be likely rejected for similar reasons.

Sure, but for string.IndexOf(string), much like for Array.Sort() there has to be some "function" for how much code size increase is acceptable per speed-up.

I don't expect to be living in a world without limitations, but would just like to get some sort of feel/guidance to what those limitations are.

As a very crude example, we could come with a fictional:

                             code-size-increase 
perf-worthiness =     -------------------------------
                             speed-up-ratio

as a silly ratio to describe some acceptance number of even curve (over input size as a parameter).

One such possible curve/number would 0, where no code-size-increase is ever allowed, no matter what the speed-up is. I am pretty sure this is not what you had in mind.

Another possible one be the number 1. This is probably also pretty extreme, to the other end of the spectrum, since it might allow increasing the code size by a factor of 11x to get an 11x perf bump :)

I'm also not claiming that there is a single number for all functions, but its been roughly 6 comments into this discussion and we are nowhere near to even understanding what that number might look like for Array.Sort.

Unless you are basically saying that the number you had in mind is ~ 0.
Such an answer would personally surprise me, but I would completely accept it without further discussion :)

P.S.:
One thing I admit that I also find super weird is that given the fact that AVX2 is not emitted in R2R images (at least right now), and that user / CorLib code would probably have to call Array.Sort about 30 (IIRC) times before triggering a re-compilation with the AVX2 code path baked in...
Wouldn't this kind of make the telemetry argument irrelevant?

I mean, in the end of the day if, someone bothered to call Array.Sort() 30 times and the proposed functionality provides at the very least a 2x speedup for array sizes >= 50 and roughly peeks, rather early, at 10k elements with an 11x speedup, I would think that having the AVX2 code path even JIT'd is a huge signal of the fact that for some unknown reason, this particular caller/process is genuinely keen on sorting this specific data-type for some reason, and perhaps giving them a considerable speed-up is not the most unpleasant thing to do?
Is there something else I am missing here?

@damageboy
Copy link
Contributor Author

@damageboy damageboy commented Mar 5, 2020

  • How many apps will benefit from this improvement? I have looked at our telemetry and I do not see sorting of primitive types to be used enough to pay this much core for it. For example, an average app out there spends 1000+x more time in UTF8 encoding / decoding that in sorting of primitive types according to our telemetry.

This is actually super interesting... Both for me, and I also think that for the community at large.
Would you be able to share the actual telemetry data and/or the source of this data?

Are we talking about stack traces from 1000s of cores running bing search or customer workloads?
which types of customers? What workloads? This could be super useful for figuring out where the platform needs to move forward...

Or alternatively, are we talking about VS-intellisense-in-the-cloud like data generated from what people type in their VS? This would, on the other hand, mean something completely different, as that would be possible that code like:

if (input.IndexOf('😕') ||
    input.IndexOf('😢') ||
    input.IndexOf('😞') ||
    input.IndexOf('😒')) { ... }

definitely makes it look like string.IndexOf() being called 4 times as frequently as:

Array.Sort(_oneHundredMillionIntegersMuhhahahaha);

@redknightlois
Copy link

@redknightlois redknightlois commented Mar 5, 2020

  • How many apps will benefit from this improvement? I have looked at our telemetry and I do not see sorting of primitive types to be used enough to pay this much core for it. For example, an average app out there spends 1000+x more time in UTF8 encoding / decoding that in sorting of primitive types according to our telemetry. It makes us much more likely to accept code size increase for UTF8 encoding / decoding than for sorting of primitive types.

Anecdotic I know, but I have implemented over my work time in .Net (since the early days of 1.1) at least 6 .Sort() implementations because Array.Sort() was consistently showing as a bad guy in my line of work. Last implementation: https://github.com/ravendb/lucenenet/blob/3.0.3-netcore/src/Lucene.Net/Util/Sorter.cs . It could be argued that telemetry is at risk of being biased because heavy sorting users (those that would benefit the most) had already moved to their own implementation. I would certainly use VxSort as my standalone when it supports all types (that I am sure it will), but if the blocker is framework operations that use Sort internally I wouldn't be so lucky.

@GSPP
Copy link

@GSPP GSPP commented Mar 5, 2020

Could this be combined with TimSort? TimSort is a "practical" sorting algorithm that takes advantage of patterns that are commonly found in real-world data. I believe, it mostly takes advantage of already sorted subranges.

@benaadams
Copy link
Member

@benaadams benaadams commented Mar 5, 2020

It could be argued that telemetry is at risk of being biased because heavy sorting users (those that would benefit the most) had already moved to their own implementation.

Is used heavily according to API Port; though doesn't say what data type T is:

image

@billwert
Copy link
Member

@billwert billwert commented Mar 5, 2020

API Port indicates it is referenced. It doesn't indicate it is hot, or even called, in those applications.

@jkotas
Copy link
Member

@jkotas jkotas commented Mar 5, 2020

Would you be able to share the actual telemetry data and/or the source of this data?

We have data about where Azure cloud services spend time inside .NET. This telemetry is enabled only for code that is owned by Microsoft. We take privacy and confidentiality very seriously. Collecting this data for customer workloads would not be acceptable.

It is aggregate over thousands of individual cloud services written by hundreds of teams running on many thousands of machines. You can think about it as attaching profiler to every machine for random 5 minutes each day. I am not able to share the raw data, but I can share answers to questions like what is relative time spent in one API vs. other API.

Many of the performance improvements are motivated or influenced by insights we get from this data. For example, the recent batch of major regular expression improvements was largely motivated and influenced by this data.

if the blocker is framework operations that use Sort internally I wouldn't be so lucky.

If you find these without a good workaround, we would like to know.

@redknightlois
Copy link

@redknightlois redknightlois commented Mar 5, 2020

If you find these without a good workaround, we would like to know.

Don't worry, you will ;) .. that doesn't change the fact that I would seriously want to avoid having to rebuild something as foundational as .Sort() any now and then, in the same vein that I thankfully don't have to rewrite memcpy() now as I used to. :D

@adamsitnik
Copy link
Member

@adamsitnik adamsitnik commented Mar 6, 2020

If the code size increase is too big, perhaps we should consider introducing a System.Math.Vector library with all the most popular operations vectorized?

As of today, there is plenty of libraries that implement it on their own, a good example is ML.NET: https://github.com/dotnet/machinelearning/tree/master/src/Microsoft.ML.CpuMath

If we had all of these in one place, our customers would no longer need to implement it on their own. This would decrease the number of bugs and improve the time to market (as an app|system developer you want to reuse existing blocks and put them together to solve the business problem, not reimplement well know algos on your own). It would be also a clear tradeoff for the users: you get improved performance for the price of the bigger size of your app.

@tannergooding @jkotas @danmosemsft what do you think?

@jkotas
Copy link
Member

@jkotas jkotas commented Mar 6, 2020

perhaps we should consider introducing a System.Math.Vector library with all the most popular operations vectorized?

Where do you set the bar that the operations are popular enough to be included in this library? As I have said, sorting of primitive types does not have enough use across .NET apps.

I believe that focused independent packages like VxSort are the best way to address the general need for alternative implementations with different trade-offs.

@vectorix
Copy link

@vectorix vectorix commented Mar 6, 2020

I guess one drawback of making this a separate package is that the benefits would not be available to built in collections which call Array.Sort internally: SortedSet, SortedList, List, ImmutableArray, ImmutableList, or ArraySortHelper for Span with its own sort.

Shouldn't a good linker do tree pruning and remove the sort implementation entirely from the output binary if it is not used by the user program?

On a more general note though, as interesting as this is, sorting by primitives (at least in my use cases) is not useful with in-place sorting. The more general use case is for indexing into another collection. If we would have Arrays of Structs (i.e. like DataFrame), perhaps we could use a fast vectorized sort to maintain e.g. ordering by useful primitives, such as int/long/float/doubleDateTime. So, rather than sorting values in-place, it would be preferable to instead have an output array that gives indices into the original array in a sorted order. This index array would be pre-allocated and probably overwritten on each re-sort. This would allow it to work with column-based storage, like in DataFrame where primitives are arranged in a vectorization friendly way.

Effectively, replacing/improving IntrospectiveSort in DataFrameColumn.cs:
https://github.com/dotnet/corefxlab/blob/ef94e1333ae83c7ed9ca8b72cce6b0557fadb93c/src/Microsoft.Data.Analysis/DataFrameColumn.cs

Perhaps this belongs inside Microsoft.Data.Analysis?

@jkotas
Copy link
Member

@jkotas jkotas commented Mar 6, 2020

Shouldn't a good linker do tree pruning and remove the sort implementation entirely from the output binary if it is not used by the user program?

Sorting is referenced by typical program so it cannot be removed. But typical program spends minuscule amount of time in it.

@redknightlois
Copy link

@redknightlois redknightlois commented Mar 6, 2020

Effectively, replacing/improving IntrospectiveSort in DataFrameColumn.cs:
https://github.com/dotnet/corefxlab/blob/ef94e1333ae83c7ed9ca8b72cce6b0557fadb93c/src/Microsoft.Data.Analysis/DataFrameColumn.cs

Perhaps this belongs inside Microsoft.Data.Analysis?

That was essentially my point when I pointed out that .Sort() data gathering will be biased toward the survivors, like the case of the planes in the WW2.

@jkotas
Copy link
Member

@jkotas jkotas commented Mar 6, 2020

FWIW, I have looked at everything with "sort" in the name, including unmanaged sort implementations like qsort from C-runtime, and filtering out false positives like "ProcessorThread". It did not add up to anything significant for primitive types even when all taken together.

Another interesting bit of data is that there are more cycles spent in ArrayHelper::TrySZSort on checking whether we have array of primitive types than there are cycles spent on the actual sorting of arrays of primitive types. Luckily, ArrayHelper::TrySZSort was removed in https://github.com/dotnet/coreclr/pull/27700/files#diff-a3d933514105f017e519bb98ade49903L268.

@nietras
Copy link
Contributor

@nietras nietras commented Mar 8, 2020

Great work by @damageboy indeed 👍. Related to this I have a question and concern I also voiced in damageboy/VxSort#12, is "unsafe" code acceptable in the sorting code?

When I tried getting my version of managed sort rewrite for Span in (in dotnet/corefx#26859), it was rejected due to security concerns, is that still valid? At that time I also had a goal to make sure there was 100% fidelity with previous Sort this also does not appear to be a concern? This relates to the order of equal keys, which may change. Hence, if there are items/values sorted together with keys, this will change output.

If it isn't at least at that time there were things that could be done to speed up Sort without increasing code size much. Nothing like 11x but for some cases 2x (this was on an older .NET Core and code quality improvements of the JIT may have changed that).

@jkotas
Copy link
Member

@jkotas jkotas commented Mar 8, 2020

it was rejected due to security concerns, is that still valid?

Yes, it is still valid. Somebody would need to do detailed security review of this and it would probably need additional security focused testing (fuzzing, etc.).

@tannergooding
Copy link
Member

@tannergooding tannergooding commented Mar 8, 2020

Commenting here due to the twitter thread: https://twitter.com/damageboy/status/1236660790112985090?s=20

The term "code size" has been thrown around a bit up above but my understanding is that this (and most other) PR(s) going in has little to do with the code size increase. Instead it has to do with the increased complexity of the code being submitted and therefore the increased maintenance cost on that code moving forward. That is, a PR which reduced the size of the code but made it more complex (such as using magic numbers or specialized tricks, etc) might likewise be rejected, even if it made it faster. Code size does have some impact, as bigger code can be (and is typically) harder to rationalize overall, but that isn't always the case.

We are then looking at how widely used a given method is and how frequently it is used in hot paths to determine whether the increased cost of the new code makes it "worthwhile" and that isn't easily quantified. Based on the current metrics we have, Array.Sort isn't super widely used and is likely not hitting many hot paths. Things like UTF8 or other string functions are likely significantly bigger bottlenecks. There may be some "survivorship" bias around Array.Sort, but that is also hard to quantify and would would require additional information/research/input to be rationalized.

In the terms of this PR:

  • The vectorized code isn't terribly complex to someone familiar with the hardware intrinsics, but to someone not familiar, it can basically look like magic. That is, if a bug happened in the algorithm, this isn't something that could be trivially marked as "up for grabs" and then reviewed by anyone on the team (at least not without effort and lengthening the review time).
  • The code is platform specific and would need to be reimplemented for ARM64 using their intrinsics, which increases the cost further.
  • The improvement is only available on AVX2 enabled hardware. This is applicable to most machines in the past 6-7 years, but would exclude older and low power devices as well as the default R2R scenario
  • The code is likely to increase in size, even if minimally, to support the other 9 primitive types (byte, sbyte, short, ushort, uint, long, ulong, float, and double). We may also need to consider char and/or others

So based on the metrics we have, it is difficult to justify the increased maintenance cost of the code due to the perceived limited usage of Array.Sort in perf critical scenarios and it is difficult to concretely specify what amount of change would be acceptable.

Is that about right @jkotas, @damageboy ?

@damageboy
Copy link
Contributor Author

@damageboy damageboy commented Mar 8, 2020

@tannergooding: I personally agree with everything you said in the comment above.

My only "objection" is to the "metrics" theme. I do find it very weird to talk about sorting numbers as something even requiring metrics.
To me, it's a bit like discussing if punctuation is really a thing in English based on reading a sample of 1000 twitter messages written by Donald Trump.
But then again, maybe it is just me.

@benaadams
Copy link
Member

@benaadams benaadams commented Mar 8, 2020

The improvement is only available on AVX2 enabled hardware. This is applicable to most machines in the past 6-7 years, but would exclude older and low power devices as well as the default R2R scenario

Could a form; in addition, be done for Sse (or whatever the R2R baseline is) and would that be desirable?

@tannergooding
Copy link
Member

@tannergooding tannergooding commented Mar 8, 2020

There may be some "survivorship" bias around Array.Sort, but that is also hard to quantify and would would require additional information/research/input to be rationalized.

Just looking at https://source.dot.net/#q=Sort, most of the core .NET Libraries go through Array.Sort, but once you get out of that other libraries (DataFrame, ML.NET, and even WPF) look to start rolling their own implementations. I didn't do any investigation as to whether these could use Array.Sort or as to why they are using their own, but that might be an interesting point of investigation.

@jkotas
Copy link
Member

@jkotas jkotas commented Mar 9, 2020

Is that about right

@tannergooding Yes, thank you for writing a great summary.

R2R

Among the different arguments, not being able to precompile AVX code with R2R today is less important to me. I expect that we are going to be able to do that eventually.

why they are using their own, but that might be an interesting point of investigation.

+1

ML.NET and DataFrame have its own copy of sort because of they need to sort Spans, but still need to target netstandard2.0. WPF has insertion sort optimized for small number of elements in C++.

@damageboy
Copy link
Contributor Author

@damageboy damageboy commented Mar 9, 2020

@tannergooding Thanks again for helping to calrify where everyone stands.

The only argument I would never even plan on having is about telling other people what complexity they should pile on their side on their repo, be it Microsoft or anyone else.

I think it would be helpful, in the future, to communicate more clearly about when the issue at hand is about code-size/bloat, which at least for me, and I suspect for many others, is/was a technical metric that can be discussed, measured and reasoned about and is not to be conflated with code-complexity which can often grow even as code-size is reduced...

I'll close the issue now, continue my work on the blog series and then on generalizing VxSort.
So Long, and Thanks for All the Fish.

@damageboy damageboy closed this Mar 9, 2020
@LukaszRozmej
Copy link

@LukaszRozmej LukaszRozmej commented Mar 10, 2020

This thread and comments from people more knowledgeable than me, made me think that best approach would be to have plug-in able sorting strategy in core libraries and it's collections. Then 3rd party sorting algorithms could be 1st class citizens.

@msftbot msftbot bot locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet