Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor improvement on the dataLength % 3. #114

Merged
merged 8 commits into from
Mar 17, 2020
Merged

Minor improvement on the dataLength % 3. #114

merged 8 commits into from
Mar 17, 2020

Conversation

ycrumeyrolle
Copy link
Contributor

Description

No change for x86, but save 3 mov instructions on x64 with the GetEncodedLength() method.
This is based on the PR dotnet/runtime#406
No such improvement that in the dotnet runtime as it is a modulo with a constant.

Benchmarks

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.16299.1625 (1709/FallCreatorsUpdate/Redstone3)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
Frequency=2062499 Hz, Resolution=484.8487 ns, Timer=TSC
.NET Core SDK=3.1.101
  [Host]     : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  DefaultJob : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT

Method Value Mean Error StdDev Ratio
Mod3 64 2.175 ns 0.0231 ns 0.0216 ns 1.00
FastMod3 64 2.037 ns 0.0233 ns 0.0207 ns 0.94
Mod3 65 2.187 ns 0.0353 ns 0.0331 ns 1.00
FastMod3 65 2.019 ns 0.0165 ns 0.0146 ns 0.92
Mod3 66 2.006 ns 0.0093 ns 0.0087 ns 1.00
FastMod3 66 1.801 ns 0.0180 ns 0.0150 ns 0.90

No change for x86, but save 2 `mov` instructions on x64.
Copy link
Owner

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super 👍

Some nits. Otherwise LGTM.

@@ -231,8 +231,30 @@ private static int GetNumBase64PaddingCharsAddedByEncode(int dataLength)
// 0 -> 0
// 1 -> 2
// 2 -> 1
uint mod3 = FastMod3((uint)dataLength);
return mod3 == 0 ? 0 : (int)(3 - mod3);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if the JIT can emit cmov here, so it gets branchless. Unfortunately there is no "trick" to force the JIT to do so, except some very obscure hacks (which we shouldn't use here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious to know the obscure hacks...

Copy link
Owner

@gfoidl gfoidl Feb 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactely a cmov (which would be desired), but still a branchless version with some bit-twiddling and the knowledge that true / false is represented as 1 / 0 -> snippet in sharplab
It the stack isn't used, it would be even better.

The main advantage will be when the branch-predictor can't work reliable due to "random" inputs, i.e. unknown patterns for mod3 == 0. Then this approach may gain some perf.
Contrast this with clang, so it could be pretty concise branchless code.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting: if the bool is "casted" to byte instead of int the stack-usage goes away and code looks pretty good.

Can you give this a try with the above benchmark? If it has potential to be faster, we should go with such a approach (although state otherwise above 😉).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried with the BranchlessSelect casted as byte. The results are interesting:

Method Value Mean Error StdDev Median Ratio RatioSD
Mod3 64 2.607 ns 0.0991 ns 0.2890 ns 2.519 ns 1.00 0.00
FastMod3 64 2.208 ns 0.0491 ns 0.1351 ns 2.165 ns 0.85 0.11
FastMod3_Branchless 64 2.611 ns 0.0502 ns 0.0445 ns 2.606 ns 1.07 0.06
FastMod3_BranchlessReversed 64 2.299 ns 0.0066 ns 0.0051 ns 2.300 ns 0.94 0.05
Mod3 65 2.357 ns 0.0494 ns 0.0462 ns 2.350 ns 1.00 0.00
FastMod3 65 2.108 ns 0.0411 ns 0.0365 ns 2.094 ns 0.89 0.02
FastMod3_Branchless 65 2.622 ns 0.0626 ns 0.0615 ns 2.607 ns 1.11 0.04
FastMod3_BranchlessReversed 65 2.313 ns 0.0467 ns 0.0437 ns 2.298 ns 0.98 0.02
Mod3 66 2.157 ns 0.0346 ns 0.0289 ns 2.166 ns 1.00 0.00
FastMod3 66 1.920 ns 0.0396 ns 0.0471 ns 1.918 ns 0.89 0.02
FastMod3_Branchless 66 2.637 ns 0.0588 ns 0.0577 ns 2.631 ns 1.23 0.04
FastMod3_BranchlessReversed 66 2.334 ns 0.0617 ns 0.0606 ns 2.327 ns 1.09 0.03

The 'branchless' is not as efficient as expected .
When reverting the branch BranchlessSelect(mod3 != 0, 3 - mod3, 0) instead of BranchlessSelect(mod3 == 0, 0, 3 - mod3), the result are better but still under the branched one.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the numbers. 👍

Some thoughts -- but I don't know what matches here...

The effect of cmov et.al. depends on the distribution of taken / not taken branches, i.e. on how effecient the branch predictor can work. If the branch predictor does a good job, then the "branchy code" may be more efficient, as the pipeline can speculatively process the following steps, whereas with cmov the rest of the pipeline needs to wait for the result of cmov.

The code for the branchless version is quite a bit, so it may be more efficient just to use simple and compact code with a branch -- which can be predicted, as in the benchmarks where for constant input length the result will always be the same.
I'm pretty sure that there will be a benchmark constellation where the branchless version will be quite a bit faster.

In effect I think we should go with the code as is. It's easy and not hacky...and as the numbers show quite good. Furthermore I doubt a slight improvement here will show any effect on overall perf.

Once the runtime (JIT) will detect a pattern like return cond ? a : b and emit better code (maybe with tiering) we'll gain the perf-win for free.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More or less just for fun I tried a vectorized approach to make it branchless. I show it here because you are curios 😉

Benchmark is laid out so that (hopefully) the branch-predictor won't do a good job.

|      Method |      Mean |     Error |    StdDev | Ratio | RatioSD |
|------------ |----------:|----------:|----------:|------:|--------:|
|     Default | 1.0953 ns | 0.0213 ns | 0.0199 ns |  1.00 |    0.00 |
|  Branchless | 0.8785 ns | 0.0136 ns | 0.0127 ns |  0.80 |    0.02 |
|  Vectorized | 0.9657 ns | 0.0210 ns | 0.0186 ns |  0.88 |    0.02 |
| Vectorized1 | 0.8840 ns | 0.0156 ns | 0.0146 ns |  0.81 |    0.02 |
Benchmark-code
using System;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using BenchmarkDotNet.Attributes;

#if !DEBUG
using BenchmarkDotNet.Running;
#endif

namespace ConsoleApp4
{
    class Program
    {
        static void Main(string[] args)
        {
            var bench = new Bench();

            int mod3 = 0;
            Console.WriteLine(bench.Default(mod3));
            Console.WriteLine(bench.Branchless(mod3));
            Console.WriteLine(bench.Vectorized(mod3));
            Console.WriteLine(bench.Vectorized1(mod3));
            Console.WriteLine();

            mod3 = 1;
            Console.WriteLine(bench.Default(mod3));
            Console.WriteLine(bench.Branchless(mod3));
            Console.WriteLine(bench.Vectorized(mod3));
            Console.WriteLine(bench.Vectorized1(mod3));

            Console.WriteLine();

            mod3 = 2;
            Console.WriteLine(bench.Default(mod3));
            Console.WriteLine(bench.Branchless(mod3));
            Console.WriteLine(bench.Vectorized(mod3));
            Console.WriteLine(bench.Vectorized1(mod3));

#if !DEBUG
            BenchmarkRunner.Run<Bench>();
#endif
        }
    }

    public class Bench
    {
        [Benchmark(Baseline = true, OperationsPerInvoke = 3)]
        public int Default()
        {
            int s = 0;

            for (int i = 0; i < 3; ++i)
            {
                s += this.Default(i);
            }

            return s;
        }

        [Benchmark(OperationsPerInvoke = 3)]
        public int Branchless()
        {
            int s = 0;

            for (int i = 0; i < 3; ++i)
            {
                s += this.Branchless(i);
            }

            return s;
        }

        [Benchmark(OperationsPerInvoke = 3)]
        public int Vectorized()
        {
            int s = 0;

            for (int i = 0; i < 3; ++i)
            {
                s += this.Vectorized(i);
            }

            return s;
        }

        [Benchmark(OperationsPerInvoke = 3)]
        public int Vectorized1()
        {
            int s = 0;

            for (int i = 0; i < 3; ++i)
            {
                s += this.Vectorized1(i);
            }

            return s;
        }

        public int Default(int mod3)    => mod3 == 0 ? 0 : 3 - mod3;
        public int Branchless(int mod3) => BranchlessSelect(mod3 != 0, 3 - mod3, 0);

        public int Vectorized(int mod3)
        {
            Vector128<int> vMod3       = Vector128.CreateScalarUnsafe(mod3);
            Vector128<int> zero        = Vector128<int>.Zero;
            Vector128<int> valueIfTrue = Vector128.CreateScalarUnsafe(3 - mod3);

            Vector128<int> eqZero  = Sse2.CompareEqual(vMod3, zero);
            Vector128<int> neqZero = Sse2.CompareEqual(eqZero, zero);
            Vector128<int> result  = Sse2.And(neqZero, valueIfTrue);

            return Sse2.ConvertToInt32(result);
        }

        public int Vectorized1(int mod3)
        {
            Vector128<int> vMod3       = Vector128.CreateScalarUnsafe(mod3);
            Vector128<int> zero        = Vector128<int>.Zero;
            Vector128<int> valueIfTrue = Vector128.CreateScalarUnsafe(3 - mod3);

            Vector128<int> eqZero = Sse2.CompareEqual(vMod3, zero);
            Vector128<int> result = Sse2.AndNot(eqZero, valueIfTrue);

            return Sse2.ConvertToInt32(result);
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        private static int BranchlessSelect(bool condition, int valueIfTrue, int valueIfFalse)
        {
            return ((-Unsafe.As<bool, byte>(ref condition)) & (valueIfTrue - valueIfFalse)) + valueIfFalse;
        }
    }
}

Codegen

@gfoidl gfoidl added this to the v1.1.0 milestone Feb 18, 2020
@gfoidl gfoidl added the perf label Feb 18, 2020
ycrumeyrolle and others added 3 commits February 18, 2020 10:01
Use `Environment.Is64BitProcess` instead of sizeof(IntPtr) == 8 in unsafe context.
…r.Encode.cs

Co-Authored-By: Günther Foidl <gue@korporal.at>
Co-Authored-By: Günther Foidl <gue@korporal.at>
@gfoidl
Copy link
Owner

gfoidl commented Feb 18, 2020

Looks like CI has an assert failure in this test

public void SourceLength_is_negative___throws_ArgumentOutOfRange(int sourceLength)

@ycrumeyrolle
Copy link
Contributor Author

I will take a look at this issue

@gfoidl
Copy link
Owner

gfoidl commented Feb 18, 2020

The assert failure is caused by a test, which hits

The fix is quite simple...
Get base64EncodedLen first, so the exception is thrown before the assert is hit.
35ae9e0 (you can cherry pick or just copy, whatever is easier for you).

Copy link
Owner

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- if you don't have anything else for this PR I'll merge it later today.

@gfoidl gfoidl merged commit 63b2ab2 into gfoidl:master Mar 17, 2020
@gfoidl
Copy link
Owner

gfoidl commented Mar 17, 2020

@ycrumeyrolle thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants