Expose RoundUpToPowerOf2 from BitOperations and optimize it #43135

Sergio0694 · 2020-10-07T09:51:24Z

Background and Motivation

Rounding a number to the closest >= power of two is a relatively common operation to perform, especially for developers implementing custom collection type that require a power of two to be used in one of the internal data structures (for reference, CoreCLR uses this method in the ConcurrentQueue type (here). This proposal has two main points:

Expose this API from the BitOperations class
Optimize it using intrinsics

Proposed API

namespace System.Numerics
{
    public static class BitOperations
    {
+        public static uint RoundUpToPowerOf2(uint i);
+        public static ulong RoundUpToPowerOf2(ulong i);
    }
}

Usage Examples

I'm using this same method in a couple places in the Microsoft.Toolkit.HighPerformance package:

In the StringPool type, to initialize the internal buckets for the cached string-s (to get a fast % op)
In the ArrayPoolBufferWriter<T> type, to round up the requested size to ArrayPool<T> and avoid repeated new[] allocations

Within CoreCLR, there's also that ConcurrentQueue usage example I mentioned above.

Details

The implementation would include vectorized paths like LeadingZeroCount does, checking for Lzcnt, ArmBase and then X86Base, or alternatively it would use software fallback currently (always) used within ConcurrentQueue.

Notes

If the API proposal is approved I'd be happy to help out and make a PR for this 😄

The text was updated successfully, but these errors were encountered:

ghost · 2020-10-07T09:51:31Z

Tagging subscribers to this area: @tannergooding, @pgovind, @jeffhandley
See info in area-owners.md if you want to be subscribed.

terrajobst · 2021-01-28T18:34:51Z

Video

Looks good as proposed

namespace System.Numerics
{
    public static class BitOperations
    {
        public static uint RoundUpToPowerOf2(uint i);
        public static ulong RoundUpToPowerOf2(ulong i);
    }
}

stephentoub · 2021-01-28T21:06:16Z

What's the behavior of uint RoundUpToPowerOf2(uint i) if a value greater than (uint.MaxValue / 2) + 1 is passed in? Throw? Return 1?

danmoseley · 2021-03-29T17:23:48Z

Optimize it using intrinsics

Is this already done?

runtime/src/libraries/System.Private.CoreLib/src/System/Numerics/BitOperations.cs

Lines 83 to 86 in 21734a4

    
           if (Lzcnt.IsSupported || ArmBase.IsSupported || X86Base.IsSupported) 
        
           { 
        
               return 1u << (32 - LeadingZeroCount(value - 1)); 
        
           }

stephentoub · 2021-03-29T17:25:12Z

Is this already done?

As of last week, yes, but there's now a new TODO:

runtime/src/libraries/System.Private.CoreLib/src/System/Numerics/BitOperations.cs

Lines 76 to 80 in 473f8f6

    
           internal static uint RoundUpToPowerOf2(uint value) 
        
           { 
        
               // TODO: https://github.com/dotnet/runtime/issues/43135 
        
               // When this is exposed publicly, decide on the behavior for the boundary cases... 
        
               // the accelerated and fallback paths differ.

tannergooding · 2021-03-29T17:30:53Z

For reference, the main two concerns are 0 and overflow (int.MaxValue + 2 through uint.MaxValue).

Given 0 is not a power of two, it makes sense for this to return 1. This is what the lzcnt path returns, but the bithacks implementation returns 0 and needs an additional value += (value == 0) ? 1 : 0.

Likewise, the "logical" solution for overflow cases is to return 0, representing the lowest 32-bits of a theoretically infinitely precise result. This is what the bithacks version returns (unless modified to account for rounding up) but the lzcnt variant ends up shifting by 32 which causes it to return 1 instead.

It would be great if we can figure out trivial additions to get the expected result in both cases without significant overhead.

tannergooding · 2021-03-29T17:33:22Z

Also noting I am slightly less concerned about overhead on the fallback path. Due to X86Base.BitScanReverse and ArmBase.LeadingZeroCount both being baseline instructions, the fallback path will only be hit for ARM32 (and we could handle that one explicitly since it isn't SIMD and isn't blocked for ARM32 due to needing LSRA changes).

danmoseley · 2021-03-29T17:38:44Z

Likewise, the "logical" solution for overflow cases is to return 0

Is that the result that would be most useful for an algorithm (ie it would actually matter) and what other platforms do? Just curious.

tannergooding · 2021-03-29T17:48:31Z

Is that the result that would be most useful for an algorithm (ie it would actually matter) and what other platforms do? Just curious.

The most useful to the algorithm. It allows 0 to be handled correctly and then allows you to correctly detect overflow.

Other platforms typically use this as an internal helper, most often using the "software fallback" case where they can just treat 0 as invalid and therefore overflow also becomes 0.

I would also be fine with saying 0 rounds to 0, rather than 1. I think that overflowing int.MaxValue + 2 to 1 is strictly incorrect behavior.

VSadov · 2021-03-29T19:07:01Z

In all my uses of similar helpers I've never thought about boundary cases - what happens at 0 and Max. It was always outside of possible scenarios. Typical use is rounding up a buffer size so that & could be used when indexing instead of %.

I think if we could just use lzcnt, when available, it would be useful enough and faster.

tannergooding · 2021-03-29T19:10:49Z

In all my uses of similar helpers I've never thought about boundary cases - what happens at 0 and Max. It was always outside of possible scenarios. Typical use is rounding up a buffer size so that & could be used when indexing instead of %.

I agree, but we still need to document the expected behavior here and ensure it is tested in case someone does pass it in. We've done this for all our publicly exposed cross-platform helpers and I believe we need to do the same here as well.

VSadov · 2021-03-29T19:36:29Z

0 -> 1 and Max -> 0 seems right.

0 -> 0 would also be ok, but Max -> 1 seems odd

VSadov · 2021-03-29T19:42:40Z

For lzcnt - would this work:

int shift = 32 - LeadingZeroCount(value - 1);
return (1u ^ (shift >> 5)) << shift;

not sure if it will be faster than a branch though, since the branch will be always the same way.

huoyaoyuan · 2021-06-10T08:57:42Z

Did a benchmark:

BenchmarkDotNet=v0.13.0, OS=Windows 10.0.19043.1052 (21H1/May2021Update)
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK=6.0.100-preview.4.21255.9
[Host] : .NET 6.0.0 (6.0.21.25307), X64 RyuJIT
DefaultJob : .NET 6.0.0 (6.0.21.25307), X64 RyuJIT

Method	value	Mean	Error	StdDev	Ratio	RatioSD
NoBranch	0	0.2388 ns	0.0069 ns	0.0065 ns	1.00	0.00
Branch1	0	0.5208 ns	0.0168 ns	0.0158 ns	2.18	0.05
Branch2	0	0.3411 ns	0.0123 ns	0.0115 ns	1.43	0.05

NoBranch	2147483647	0.1684 ns	0.0154 ns	0.0144 ns	1.00	0.00
Branch1	2147483647	0.4595 ns	0.0117 ns	0.0104 ns	2.76	0.23
Branch2	2147483647	0.3494 ns	0.0112 ns	0.0105 ns	2.09	0.23

NoBranch	2147483648	0.1794 ns	0.0081 ns	0.0072 ns	1.00	0.00
Branch1	2147483648	0.4936 ns	0.0095 ns	0.0079 ns	2.75	0.12
Branch2	2147483648	0.3666 ns	0.0149 ns	0.0132 ns	2.05	0.12

NoBranch	4294967295	0.1887 ns	0.0123 ns	0.0109 ns	1.00	0.00
Branch1	4294967295	0.0343 ns	0.0143 ns	0.0134 ns	0.18	0.07
Branch2	4294967295	0.2294 ns	0.0193 ns	0.0180 ns	1.22	0.16

        public IEnumerable<object[]> Values()
        {
            yield return new object[] { 0u };
            yield return new object[] { 0x7FFF_FFFFu };
            yield return new object[] { 0x8000_0000u };
            yield return new object[] { 0xFFFF_FFFFu };
        }

        [Benchmark(Baseline = true), ArgumentsSource(nameof(Values))]
        public uint NoBranch(uint value)
        {
            int shift = 32 - BitOperations.LeadingZeroCount(value - 1);
            return (1u ^ (uint)(shift >> 5)) << shift;
        }

        [Benchmark, ArgumentsSource(nameof(Values))]
        public uint Branch1(uint value)
        {
            if (value > unchecked((uint)int.MinValue))
                return 0;
            return 1u << (32 - BitOperations.LeadingZeroCount(value - 1));
        }

        [Benchmark, ArgumentsSource(nameof(Values))]
        public uint Branch2(uint value)
        {
            if (value <= unchecked((uint)int.MinValue))
                return 1u << (32 - BitOperations.LeadingZeroCount(value - 1));
            return 0;
        }

The no branch approach wins for "regular" cases. It also wins in code size.

huoyaoyuan · 2021-06-10T09:03:24Z

Ah, if we use 64bit shift operation, it can eliminate some corner cases.

huoyaoyuan · 2021-06-10T09:09:48Z

return (uint)(0x1_0000_0000ul >> BitOperations.LeadingZeroCount(value - 1));

OK this definitely wins.

huoyaoyuan · 2021-06-10T10:04:33Z

Dasm for reference:

https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABABgAJiBGAOgDkBXfGKASzFwG4BYAKD+IBmSgCZyAYT4BvPuTny5Q8g1YA7DOToQAQlGyqwACwAUK9eQBu2ADYMYASlkL5M3s/fk1G3IdYAzDQBeckExOHJtVgwAeQAHFmwMVghVXBoAGRhsABM1AHMALRYIcQgGdWMrWxhycKp7HjcPBWIAdnJjKgZyAD0Oswx7Yx9/DQA+MfIAVnt7cgAeefIRgMbmgF8nOS3nJQHNHT0DQ1RIjFMvSxs7Ryb3V2aFS5WgkLCIqLiEpJS0zJz8kUoCUyhUqnZauR6mtHop2hd1EMutYFksXg0dgpNndyJj5HtLlpdPojABlQxQBEacEOPFyB6wyjwgZDUgIKgAfVI3NIXJ5DBREw+MXieh+qQyWVyqkKxVK5XONMh9QxOKxfDplGE+2JxyoVKu1VuzQZsP8HSVk3KRhgYAA1jBssYqfYvDQALJqABq1wcxsZLXapBhsLaUO6iw6oUhZy+YuSEv+0tlwPlYN9ytmIec2M1BPMuqMIgNNP9901Tz8Foz82C1sMtodTpdbs9qh9RrLAaZ4dRUfesdFiQTfylgLloMVGbqWYr+KD2fVvHWQA===

Sergio0694 added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Oct 7, 2020

Dotnet-GitSync-Bot added area-System.Numerics untriaged New issue has not been triaged by the area owner labels Oct 7, 2020

tannergooding added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation untriaged New issue has not been triaged by the area owner labels Oct 12, 2020

tannergooding added this to the Future milestone Oct 12, 2020

tannergooding removed this from the Future milestone Jan 28, 2021

terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Jan 28, 2021

tannergooding added the help wanted [up-for-grabs] Good issue for external contributors label Jan 28, 2021

stephentoub mentioned this issue Mar 23, 2021

Move ConcurrentQueueSegment<T>.RoundUpToPowerOf2 to BitOperations #50070

Merged

gfoidl mentioned this issue Mar 29, 2021

Non-blocking ConcurrentDictionary #50337

Closed

4 tasks

huoyaoyuan mentioned this issue Jun 10, 2021

Expose RoundUpToPowerOf2 #53992

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 10, 2021

ghost closed this as completed in #53992 Jun 11, 2021

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jun 11, 2021

dotnet locked as resolved and limited conversation to collaborators Jul 11, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose RoundUpToPowerOf2 from BitOperations and optimize it #43135

Expose RoundUpToPowerOf2 from BitOperations and optimize it #43135

Sergio0694 commented Oct 7, 2020 •

edited by tannergooding

ghost commented Oct 7, 2020

terrajobst commented Jan 28, 2021 •

edited by dotnet-api-review bot

stephentoub commented Jan 28, 2021

danmoseley commented Mar 29, 2021

stephentoub commented Mar 29, 2021 •

edited

tannergooding commented Mar 29, 2021

tannergooding commented Mar 29, 2021

danmoseley commented Mar 29, 2021

tannergooding commented Mar 29, 2021

VSadov commented Mar 29, 2021

tannergooding commented Mar 29, 2021

VSadov commented Mar 29, 2021

VSadov commented Mar 29, 2021 •

edited

huoyaoyuan commented Jun 10, 2021

huoyaoyuan commented Jun 10, 2021

huoyaoyuan commented Jun 10, 2021

huoyaoyuan commented Jun 10, 2021

Expose RoundUpToPowerOf2 from BitOperations and optimize it #43135

Expose RoundUpToPowerOf2 from BitOperations and optimize it #43135

Comments

Sergio0694 commented Oct 7, 2020 • edited by tannergooding

Background and Motivation

Proposed API

Usage Examples

Details

Notes

ghost commented Oct 7, 2020

terrajobst commented Jan 28, 2021 • edited by dotnet-api-review bot

stephentoub commented Jan 28, 2021

danmoseley commented Mar 29, 2021

stephentoub commented Mar 29, 2021 • edited

tannergooding commented Mar 29, 2021

tannergooding commented Mar 29, 2021

danmoseley commented Mar 29, 2021

tannergooding commented Mar 29, 2021

VSadov commented Mar 29, 2021

tannergooding commented Mar 29, 2021

VSadov commented Mar 29, 2021

VSadov commented Mar 29, 2021 • edited

huoyaoyuan commented Jun 10, 2021

huoyaoyuan commented Jun 10, 2021

huoyaoyuan commented Jun 10, 2021

huoyaoyuan commented Jun 10, 2021

Sergio0694 commented Oct 7, 2020 •

edited by tannergooding

terrajobst commented Jan 28, 2021 •

edited by dotnet-api-review bot

stephentoub commented Mar 29, 2021 •

edited

VSadov commented Mar 29, 2021 •

edited