Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose RoundUpToPowerOf2 from BitOperations and optimize it #43135

Closed
Sergio0694 opened this issue Oct 7, 2020 · 17 comments · Fixed by #53992
Closed

Expose RoundUpToPowerOf2 from BitOperations and optimize it #43135

Sergio0694 opened this issue Oct 7, 2020 · 17 comments · Fixed by #53992
Labels
api-approved API was approved in API review, it can be implemented area-System.Numerics help wanted [up-for-grabs] Good issue for external contributors

Comments

@Sergio0694
Copy link
Contributor

Sergio0694 commented Oct 7, 2020

Background and Motivation

Rounding a number to the closest >= power of two is a relatively common operation to perform, especially for developers implementing custom collection type that require a power of two to be used in one of the internal data structures (for reference, CoreCLR uses this method in the ConcurrentQueue type (here). This proposal has two main points:

  • Expose this API from the BitOperations class
  • Optimize it using intrinsics

Proposed API

namespace System.Numerics
{
    public static class BitOperations
    {
+        public static uint RoundUpToPowerOf2(uint i);
+        public static ulong RoundUpToPowerOf2(ulong i);
    }
}

Usage Examples

I'm using this same method in a couple places in the Microsoft.Toolkit.HighPerformance package:

  • In the StringPool type, to initialize the internal buckets for the cached string-s (to get a fast % op)
  • In the ArrayPoolBufferWriter<T> type, to round up the requested size to ArrayPool<T> and avoid repeated new[] allocations

Within CoreCLR, there's also that ConcurrentQueue usage example I mentioned above.

Details

The implementation would include vectorized paths like LeadingZeroCount does, checking for Lzcnt, ArmBase and then X86Base, or alternatively it would use software fallback currently (always) used within ConcurrentQueue.

Notes

If the API proposal is approved I'd be happy to help out and make a PR for this 😄

@Sergio0694 Sergio0694 added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Oct 7, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Numerics untriaged New issue has not been triaged by the area owner labels Oct 7, 2020
@ghost
Copy link

ghost commented Oct 7, 2020

Tagging subscribers to this area: @tannergooding, @pgovind, @jeffhandley
See info in area-owners.md if you want to be subscribed.

@tannergooding tannergooding added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation untriaged New issue has not been triaged by the area owner labels Oct 12, 2020
@tannergooding tannergooding added this to the Future milestone Oct 12, 2020
@tannergooding tannergooding removed this from the Future milestone Jan 28, 2021
@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Jan 28, 2021
@terrajobst
Copy link
Member

terrajobst commented Jan 28, 2021

Video

  • Looks good as proposed
namespace System.Numerics
{
    public static class BitOperations
    {
        public static uint RoundUpToPowerOf2(uint i);
        public static ulong RoundUpToPowerOf2(ulong i);
    }
}

@tannergooding tannergooding added the help wanted [up-for-grabs] Good issue for external contributors label Jan 28, 2021
@stephentoub
Copy link
Member

What's the behavior of uint RoundUpToPowerOf2(uint i) if a value greater than (uint.MaxValue / 2) + 1 is passed in? Throw? Return 1?

@danmoseley
Copy link
Member

Optimize it using intrinsics

Is this already done?

if (Lzcnt.IsSupported || ArmBase.IsSupported || X86Base.IsSupported)
{
return 1u << (32 - LeadingZeroCount(value - 1));
}

@stephentoub
Copy link
Member

stephentoub commented Mar 29, 2021

Is this already done?

As of last week, yes, but there's now a new TODO:

internal static uint RoundUpToPowerOf2(uint value)
{
// TODO: https://github.com/dotnet/runtime/issues/43135
// When this is exposed publicly, decide on the behavior for the boundary cases...
// the accelerated and fallback paths differ.

@tannergooding
Copy link
Member

For reference, the main two concerns are 0 and overflow (int.MaxValue + 2 through uint.MaxValue).

Given 0 is not a power of two, it makes sense for this to return 1. This is what the lzcnt path returns, but the bithacks implementation returns 0 and needs an additional value += (value == 0) ? 1 : 0.

Likewise, the "logical" solution for overflow cases is to return 0, representing the lowest 32-bits of a theoretically infinitely precise result. This is what the bithacks version returns (unless modified to account for rounding up) but the lzcnt variant ends up shifting by 32 which causes it to return 1 instead.

It would be great if we can figure out trivial additions to get the expected result in both cases without significant overhead.

@tannergooding
Copy link
Member

Also noting I am slightly less concerned about overhead on the fallback path. Due to X86Base.BitScanReverse and ArmBase.LeadingZeroCount both being baseline instructions, the fallback path will only be hit for ARM32 (and we could handle that one explicitly since it isn't SIMD and isn't blocked for ARM32 due to needing LSRA changes).

@danmoseley
Copy link
Member

Likewise, the "logical" solution for overflow cases is to return 0

Is that the result that would be most useful for an algorithm (ie it would actually matter) and what other platforms do? Just curious.

@tannergooding
Copy link
Member

Is that the result that would be most useful for an algorithm (ie it would actually matter) and what other platforms do? Just curious.

The most useful to the algorithm. It allows 0 to be handled correctly and then allows you to correctly detect overflow.

Other platforms typically use this as an internal helper, most often using the "software fallback" case where they can just treat 0 as invalid and therefore overflow also becomes 0.

I would also be fine with saying 0 rounds to 0, rather than 1. I think that overflowing int.MaxValue + 2 to 1 is strictly incorrect behavior.

@VSadov
Copy link
Member

VSadov commented Mar 29, 2021

In all my uses of similar helpers I've never thought about boundary cases - what happens at 0 and Max. It was always outside of possible scenarios. Typical use is rounding up a buffer size so that & could be used when indexing instead of %.

I think if we could just use lzcnt, when available, it would be useful enough and faster.

@tannergooding
Copy link
Member

In all my uses of similar helpers I've never thought about boundary cases - what happens at 0 and Max. It was always outside of possible scenarios. Typical use is rounding up a buffer size so that & could be used when indexing instead of %.

I agree, but we still need to document the expected behavior here and ensure it is tested in case someone does pass it in. We've done this for all our publicly exposed cross-platform helpers and I believe we need to do the same here as well.

@VSadov
Copy link
Member

VSadov commented Mar 29, 2021

0 -> 1 and Max -> 0 seems right.

0 -> 0 would also be ok, but Max -> 1 seems odd

@VSadov
Copy link
Member

VSadov commented Mar 29, 2021

For lzcnt - would this work:

int shift = 32 - LeadingZeroCount(value - 1);
return (1u ^ (shift >> 5)) << shift;

not sure if it will be faster than a branch though, since the branch will be always the same way.

@huoyaoyuan
Copy link
Member

Did a benchmark:

BenchmarkDotNet=v0.13.0, OS=Windows 10.0.19043.1052 (21H1/May2021Update)
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK=6.0.100-preview.4.21255.9
[Host] : .NET 6.0.0 (6.0.21.25307), X64 RyuJIT
DefaultJob : .NET 6.0.0 (6.0.21.25307), X64 RyuJIT

Method value Mean Error StdDev Ratio RatioSD
NoBranch 0 0.2388 ns 0.0069 ns 0.0065 ns 1.00 0.00
Branch1 0 0.5208 ns 0.0168 ns 0.0158 ns 2.18 0.05
Branch2 0 0.3411 ns 0.0123 ns 0.0115 ns 1.43 0.05
NoBranch 2147483647 0.1684 ns 0.0154 ns 0.0144 ns 1.00 0.00
Branch1 2147483647 0.4595 ns 0.0117 ns 0.0104 ns 2.76 0.23
Branch2 2147483647 0.3494 ns 0.0112 ns 0.0105 ns 2.09 0.23
NoBranch 2147483648 0.1794 ns 0.0081 ns 0.0072 ns 1.00 0.00
Branch1 2147483648 0.4936 ns 0.0095 ns 0.0079 ns 2.75 0.12
Branch2 2147483648 0.3666 ns 0.0149 ns 0.0132 ns 2.05 0.12
NoBranch 4294967295 0.1887 ns 0.0123 ns 0.0109 ns 1.00 0.00
Branch1 4294967295 0.0343 ns 0.0143 ns 0.0134 ns 0.18 0.07
Branch2 4294967295 0.2294 ns 0.0193 ns 0.0180 ns 1.22 0.16
        public IEnumerable<object[]> Values()
        {
            yield return new object[] { 0u };
            yield return new object[] { 0x7FFF_FFFFu };
            yield return new object[] { 0x8000_0000u };
            yield return new object[] { 0xFFFF_FFFFu };
        }

        [Benchmark(Baseline = true), ArgumentsSource(nameof(Values))]
        public uint NoBranch(uint value)
        {
            int shift = 32 - BitOperations.LeadingZeroCount(value - 1);
            return (1u ^ (uint)(shift >> 5)) << shift;
        }

        [Benchmark, ArgumentsSource(nameof(Values))]
        public uint Branch1(uint value)
        {
            if (value > unchecked((uint)int.MinValue))
                return 0;
            return 1u << (32 - BitOperations.LeadingZeroCount(value - 1));
        }

        [Benchmark, ArgumentsSource(nameof(Values))]
        public uint Branch2(uint value)
        {
            if (value <= unchecked((uint)int.MinValue))
                return 1u << (32 - BitOperations.LeadingZeroCount(value - 1));
            return 0;
        }

The no branch approach wins for "regular" cases. It also wins in code size.

@huoyaoyuan
Copy link
Member

Ah, if we use 64bit shift operation, it can eliminate some corner cases.

@huoyaoyuan
Copy link
Member

return (uint)(0x1_0000_0000ul >> BitOperations.LeadingZeroCount(value - 1));

OK this definitely wins.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 10, 2021
@ghost ghost closed this as completed in #53992 Jun 11, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jun 11, 2021
@dotnet dotnet locked as resolved and limited conversation to collaborators Jul 11, 2021
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Numerics help wanted [up-for-grabs] Good issue for external contributors
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants