Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BitOperations.IsPow2 for all supported integral types #36163

Merged
8 commits merged into from
Feb 8, 2021

Conversation

john-h-k
Copy link
Contributor

@john-h-k john-h-k commented May 9, 2020

Fixes #31297

Currently writing tests locally

@Dotnet-GitSync-Bot
Copy link
Collaborator

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@john-h-k john-h-k reopened this Jun 12, 2020
@john-h-k john-h-k marked this pull request as ready for review June 12, 2020 21:58
@john-h-k john-h-k changed the title Draft BitOperations.IsPow2 for various integral types BitOperations.IsPow2 for all supported integral types Jun 12, 2020
{
// 1 set bit means number is a power of 2 unless it is the sign bit, so get rid of that
// Cast to unsigned type so logic shift not signed arithmetic shift
return PopCount((uint)value << 1) == 1;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(int)0x80000001 is not a power of 2, so you can't just discard the sign bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, it doesn't match software fallback behavior (add more test cases)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually matches the bug in the SoftwareFallback((uint)value << 1) function call. 😉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, indeed, I didn't notice << 1

Copy link
Contributor Author

@john-h-k john-h-k Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops sorry - will add a test for that and fix the fallback. I think the best way to do it in the accelerated route is

uint unsigned = (uint)value;
uint withoutSign = unsigned << 1;
uint mask = unsigned >> 31;
return PopCount(withoutSign) & ~mask == 1;

which is probably cheaper than a branch

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant, if you are using >> 31 and & ~mask to ensure that IsPow2(value) returns false when value < 0, then it does not matter what PopCount returns for those values, so the << 1 is not necessary.

Is the code with PopCount actually faster than the software fallback, in which the value >= 0 check can be combined with the value != 0 check that is necessary anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on Skylake, where popcount isn't as fast, and it seems branchless popcount is still generally faster than branch + popcount or the software fallback. Different should be even bigger on Ryzen where popcount is faster

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide some numbers and the benchmark you are running?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this just (value > 0) && IsPow2((uint)value)?

Negative values generally aren't considered to be powers of 2, since you can't raise either 2 or -2 to n to get them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsPow2((uint)value) almost works, except that -2147483648 would return true.

@danmoseley
Copy link
Member

@john-h-k are you able to provide the benchmark numbers requested? thanks for the contribution

@ghost
Copy link

ghost commented Oct 20, 2020

Tagging subscribers to this area: @tannergooding, @pgovind, @jeffhandley
See info in area-owners.md if you want to be subscribed.

@jeffhandley
Copy link
Member

@tannergooding @pgovind Is this something we could get perf numbers for ourselves?

@tannergooding
Copy link
Member

From a discussion on discord, @john-h-k reported the following using https://gist.github.com/john-h-k/f69f3017143e746c5f6e9f904a32767c:

Method Data Type Mean Error StdDev
IsPow2_Branch_Popcnt_Random Zero 803.3 ns 1.20 ns 1.12 ns
IsPow2_Branch_Software_Random Zero 788.0 ns 0.55 ns 0.52 ns
IsPow2_BranchInverted_Software_Random Zero 787.9 ns 0.75 ns 0.67 ns
IsPow2_Branchless_Popcnt_Random Zero 824.9 ns 1.68 ns 1.57 ns
IsPow2_Branch_Popcnt_Random Random 3,380.9 ns 3.71 ns 3.47 ns
IsPow2_Branch_Software_Random Random 935.6 ns 7.80 ns 7.30 ns
IsPow2_BranchInverted_Software_Random Random 3,432.1 ns 4.92 ns 4.60 ns
IsPow2_Branchless_Popcnt_Random Random 835.3 ns 1.04 ns 0.98 ns
IsPow2_Branch_Popcnt_Random UniformPow2 801.7 ns 0.93 ns 0.77 ns
IsPow2_Branch_Software_Random UniformPow2 788.0 ns 0.95 ns 0.84 ns
IsPow2_BranchInverted_Software_Random UniformPow2 790.6 ns 1.31 ns 1.23 ns
IsPow2_Branchless_Popcnt_Random UniformPow2 824.8 ns 0.37 ns 0.34 ns
IsPow2_Branch_Popcnt_Random UniformNonPow2 802.6 ns 1.39 ns 1.30 ns
IsPow2_Branch_Software_Random UniformNonPow2 788.1 ns 1.31 ns 1.16 ns
IsPow2_BranchInverted_Software_Random UniformNonPow2 788.1 ns 0.58 ns 0.52 ns
IsPow2_Branchless_Popcnt_Random UniformNonPow2 835.9 ns 1.42 ns 1.33 ns
IsPow2_Branch_Popcnt_Random UniformNegative 533.8 ns 0.81 ns 0.76 ns
IsPow2_Branch_Software_Random UniformNegative 787.4 ns 1.54 ns 1.44 ns
IsPow2_BranchInverted_Software_Random UniformNegative 533.6 ns 0.49 ns 0.43 ns
IsPow2_Branchless_Popcnt_Random UniformNegative 823.6 ns 0.46 ns 0.36 ns
IsPow2_Branch_Popcnt_Random Alternating 801.0 ns 0.35 ns 0.31 ns
IsPow2_Branch_Software_Random Alternating 809.5 ns 3.08 ns 2.89 ns
IsPow2_BranchInverted_Software_Random Alternating 787.7 ns 1.10 ns 0.92 ns
IsPow2_Branchless_Popcnt_Random Alternating 826.5 ns 4.40 ns 4.12 ns
IsPow2_Branch_Popcnt_Random AlternatingSign 801.1 ns 4.50 ns 4.21 ns
IsPow2_Branch_Software_Random AlternatingSign 809.1 ns 3.48 ns 3.26 ns
IsPow2_BranchInverted_Software_Random AlternatingSign 536.7 ns 0.98 ns 0.87 ns
IsPow2_Branchless_Popcnt_Random AlternatingSign 833.1 ns 0.56 ns 0.50 ns

@tannergooding
Copy link
Member

My own box reports:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET Core SDK=6.0.100-alpha.1.21056.1
  [Host]     : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT
  DefaultJob : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT
Method Type Mean Error StdDev Median
IsPow2_Branch_Popcnt_Random Zero 521.3 ns 1.14 ns 1.07 ns 521.4 ns
IsPow2_Branch_Software_Random Zero 523.2 ns 0.68 ns 0.60 ns 523.2 ns
IsPow2_BranchInverted_Software_Random Zero 657.8 ns 1.62 ns 1.26 ns 657.8 ns
IsPow2_Branchless_Popcnt_Random Zero 707.1 ns 5.45 ns 5.09 ns 707.5 ns
IsPow2_Branch_Popcnt_Random Random 874.7 ns 10.33 ns 9.66 ns 870.2 ns
IsPow2_Branch_Software_Random Random 657.0 ns 2.20 ns 2.06 ns 656.6 ns
IsPow2_BranchInverted_Software_Random Random 853.5 ns 3.77 ns 3.15 ns 853.0 ns
IsPow2_Branchless_Popcnt_Random Random 703.6 ns 4.36 ns 4.08 ns 703.0 ns
IsPow2_Branch_Popcnt_Random UniformPow2 666.2 ns 6.06 ns 5.38 ns 664.4 ns
IsPow2_Branch_Software_Random UniformPow2 663.8 ns 3.51 ns 3.28 ns 664.2 ns
IsPow2_BranchInverted_Software_Random UniformPow2 664.7 ns 3.10 ns 2.59 ns 664.5 ns
IsPow2_Branchless_Popcnt_Random UniformPow2 722.7 ns 6.06 ns 5.67 ns 719.2 ns
IsPow2_Branch_Popcnt_Random UniformNonPow2 660.9 ns 2.43 ns 2.15 ns 661.1 ns
IsPow2_Branch_Software_Random UniformNonPow2 450.1 ns 7.37 ns 6.89 ns 446.3 ns
IsPow2_BranchInverted_Software_Random UniformNonPow2 537.0 ns 3.02 ns 2.52 ns 537.7 ns
IsPow2_Branchless_Popcnt_Random UniformNonPow2 722.7 ns 5.85 ns 5.47 ns 719.3 ns
IsPow2_Branch_Popcnt_Random UniformNegative 661.6 ns 1.13 ns 1.00 ns 661.7 ns
IsPow2_Branch_Software_Random UniformNegative 450.0 ns 8.97 ns 11.34 ns 442.6 ns
IsPow2_BranchInverted_Software_Random UniformNegative 443.0 ns 2.68 ns 2.50 ns 443.3 ns
IsPow2_Branchless_Popcnt_Random UniformNegative 705.2 ns 3.37 ns 3.16 ns 704.9 ns
IsPow2_Branch_Popcnt_Random Alternating 526.5 ns 3.16 ns 2.96 ns 526.2 ns
IsPow2_Branch_Software_Random Alternating 444.3 ns 3.50 ns 2.92 ns 444.4 ns
IsPow2_BranchInverted_Software_Random Alternating 534.5 ns 2.92 ns 2.73 ns 533.8 ns
IsPow2_Branchless_Popcnt_Random Alternating 702.0 ns 3.15 ns 2.94 ns 702.3 ns
IsPow2_Branch_Popcnt_Random AlternatingSign 445.6 ns 4.03 ns 3.77 ns 445.6 ns
IsPow2_Branch_Software_Random AlternatingSign 446.6 ns 5.15 ns 4.57 ns 445.1 ns
IsPow2_BranchInverted_Software_Random AlternatingSign 443.7 ns 2.74 ns 2.57 ns 443.7 ns
IsPow2_Branchless_Popcnt_Random AlternatingSign 702.9 ns 4.49 ns 4.20 ns 703.4 ns

@tannergooding
Copy link
Member

Given the above, it looks like the right approach is to go with IsPow2_Branch_Software_Random.

The difference between using popcount and not is minimal on modern AMD and Intel. However, on the older machines it looks to be significantly slower for random inputs and it is likely not worth taking this hit.

@AntonLapounov
Copy link
Member

AntonLapounov commented Jan 20, 2021

I think for ints we should use one of the two variants below. The second variant uses only a single branch when inlined; however, it would require x & -xblsi ·,x JIT optimization. C++ compiler output is provided for reference, JIT is not that smart yet.

    // C++ compiler output
    //  x64                         ARM64
    //  -------------------------------------------------
    //  blsr    eax, ecx            sub     w1, w0, #1
    //  sete    dl                  tst     w1, w0
    //  test    ecx, ecx            ccmp    w0, 0, 4, eq
    //  setg    al                  cset    w0, gt
    //  and     al, dl              ret
    //  ret
    static bool IsPow2A(int x)
    {
        return (x & (x - 1)) == 0 && x > 0;
    }

    // C++ compiler output
    //  x64                         ARM64
    //  -------------------------------------------------
    //  blsi    eax, ecx            neg     w1, w0
    //  shr     ecx, 1              and     w1, w1, w0
    //  cmp     eax, ecx            cmp     w1, w0, lsr 1
    //  setg    al                  cset    w0, gt
    //  ret                         ret
    static bool IsPow2B(int x)
    {
        // Unsigned shift, signed comparison
        return (x & -x) > (int)((uint)x >> 1);
    }

/// <param name="value">The value.</param>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
[CLSCompliant(false)]
public static bool IsPow2(uint value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update:


and

to use IsPow2?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to use just if ((value & (value - 1)) != 0) without checking for 0. Otherwise Log2Ceiling(0) would return 1, which is greater than Log2Ceiling(1).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log2Ceiling(0) doesn't matter right now because no code will do that. If Log2Ceiling is every made public, erroneous inputs will need to be considered then.

Copy link
Member

@AntonLapounov AntonLapounov Jan 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only it would be correct, it should be faster as well. Ideally (value & (value - 1)) should be compiled into a single blsr instruction on x64.

Copy link
Member

@stephentoub stephentoub Feb 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever implementation is chosen for IsPow2, it doesn't change the fact that these call sites should use IsPow2. That's what my initial comment here is about.

[CLSCompliant(false)]
public static bool IsPow2(uint value)
{
if (Popcnt.IsSupported)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PopCount also has a path that special-cases AdvSimd.Arm64.IsSupported. Is that not worth checking for here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current stance is that there looks to be some open questions around popcnt perf, particularly with regards to random inputs: #36163 (comment), and so it might be better to just not have Popcnt until we can get some more information as to what's causing this.

That is, is it due to popcnt or is it due to bad branch prediction?
If it's the latter, then what is the "optimal" ordering based on expected inputs?

I think that these can largely be answered in a follow PR given the perf difference, even in the winning cases, is fairly minor.

When we do get to looking at that, I'd expect that IsPow2 is largely used with validation and with positive inputs for the int case, in which case doing the "popcnt" check before the "edge case check" would be better.
This is because the "popcnt" check will cover everything except int.MinValue (0x8000_0000).
This means that if the assumption about inputs is correct, then the "expected" case will be you always do two compares and in the case where it's wrong, the first check results in an early exit in more cases than the "edge case check".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that Aarch64 has no POPCNT instruction for general-purpose registers. It would normally take 4 instructions: fmov, cnt, addv, fmov, where the first fmov instruction copies the value from a general-purpose register to a SIMD register and the last fmov instruction copies the calculated population count back to a general-purpose register for a comparison.

@tannergooding Should you run tests again, please try the IsPow2B variation I mentioned above.

@@ -53,30 +53,30 @@ static class BitOperations
/// </summary>
/// <param name="value">The value.</param>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static bool IsPow2(int value) => value >= 0 && (value & (value - 1)) == 0;
public static bool IsPow2(int value) => (value & (value - 1)) == 0 && value >= 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be && (value > 0) because ((value & (value - 1)) == 0) will be true for 0 (hence why we have && (value != 0) for the unsigned variant)

@ghost
Copy link

ghost commented Feb 8, 2021

Hello @tannergooding!

Because this pull request has the auto-merge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

@ghost ghost merged commit 7ac05af into dotnet:master Feb 8, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Mar 10, 2021
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider introducing BitOperations.IsPow2 or Math.IsPow2