Minor improvement on the `dataLength % 3`. #114

ycrumeyrolle · 2020-02-17T22:41:27Z

Description

No change for x86, but save 3 mov instructions on x64 with the GetEncodedLength() method.
This is based on the PR dotnet/runtime#406
No such improvement that in the dotnet runtime as it is a modulo with a constant.

Benchmarks

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.16299.1625 (1709/FallCreatorsUpdate/Redstone3)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
Frequency=2062499 Hz, Resolution=484.8487 ns, Timer=TSC
.NET Core SDK=3.1.101
  [Host]     : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  DefaultJob : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT

Method	Value	Mean	Error	StdDev	Ratio
Mod3	64	2.175 ns	0.0231 ns	0.0216 ns	1.00
FastMod3	64	2.037 ns	0.0233 ns	0.0207 ns	0.94

Mod3	65	2.187 ns	0.0353 ns	0.0331 ns	1.00
FastMod3	65	2.019 ns	0.0165 ns	0.0146 ns	0.92

Mod3	66	2.006 ns	0.0093 ns	0.0087 ns	1.00
FastMod3	66	1.801 ns	0.0180 ns	0.0150 ns	0.90

No change for x86, but save 2 `mov` instructions on x64.

gfoidl

Super 👍

Some nits. Otherwise LGTM.

source/gfoidl.Base64/Internal/Base64UrlEncoder/Base64UrlEncoder.Encode.cs

gfoidl · 2020-02-18T00:28:03Z

source/gfoidl.Base64/Internal/Base64UrlEncoder/Base64UrlEncoder.Encode.cs

@@ -231,8 +231,30 @@ private static int GetNumBase64PaddingCharsAddedByEncode(int dataLength)
            // 0 -> 0
            // 1 -> 2
            // 2 -> 1
+            uint mod3 = FastMod3((uint)dataLength);
+            return mod3 == 0 ? 0 : (int)(3 - mod3);


It would be nice if the JIT can emit cmov here, so it gets branchless. Unfortunately there is no "trick" to force the JIT to do so, except some very obscure hacks (which we shouldn't use here).

I am curious to know the obscure hacks...

Not exactely a cmov (which would be desired), but still a branchless version with some bit-twiddling and the knowledge that true / false is represented as 1 / 0 -> snippet in sharplab
It the stack isn't used, it would be even better.

The main advantage will be when the branch-predictor can't work reliable due to "random" inputs, i.e. unknown patterns for mod3 == 0. Then this approach may gain some perf.
Contrast this with clang, so it could be pretty concise branchless code.

Interesting: if the bool is "casted" to byte instead of int the stack-usage goes away and code looks pretty good.

Can you give this a try with the above benchmark? If it has potential to be faster, we should go with such a approach (although state otherwise above 😉).

Tried with the BranchlessSelect casted as byte. The results are interesting:

Method Value Mean Error StdDev Median Ratio RatioSD

Mod3 64 2.607 ns 0.0991 ns 0.2890 ns 2.519 ns 1.00 0.00

FastMod3 64 2.208 ns 0.0491 ns 0.1351 ns 2.165 ns 0.85 0.11

FastMod3_Branchless 64 2.611 ns 0.0502 ns 0.0445 ns 2.606 ns 1.07 0.06

FastMod3_BranchlessReversed 64 2.299 ns 0.0066 ns 0.0051 ns 2.300 ns 0.94 0.05

Mod3 65 2.357 ns 0.0494 ns 0.0462 ns 2.350 ns 1.00 0.00

FastMod3 65 2.108 ns 0.0411 ns 0.0365 ns 2.094 ns 0.89 0.02

FastMod3_Branchless 65 2.622 ns 0.0626 ns 0.0615 ns 2.607 ns 1.11 0.04

FastMod3_BranchlessReversed 65 2.313 ns 0.0467 ns 0.0437 ns 2.298 ns 0.98 0.02

Mod3 66 2.157 ns 0.0346 ns 0.0289 ns 2.166 ns 1.00 0.00

FastMod3 66 1.920 ns 0.0396 ns 0.0471 ns 1.918 ns 0.89 0.02

FastMod3_Branchless 66 2.637 ns 0.0588 ns 0.0577 ns 2.631 ns 1.23 0.04

FastMod3_BranchlessReversed 66 2.334 ns 0.0617 ns 0.0606 ns 2.327 ns 1.09 0.03

The 'branchless' is not as efficient as expected .
When reverting the branch BranchlessSelect(mod3 != 0, 3 - mod3, 0) instead of BranchlessSelect(mod3 == 0, 0, 3 - mod3), the result are better but still under the branched one.

Thanks for the numbers. 👍

Some thoughts -- but I don't know what matches here...

The effect of cmov et.al. depends on the distribution of taken / not taken branches, i.e. on how effecient the branch predictor can work. If the branch predictor does a good job, then the "branchy code" may be more efficient, as the pipeline can speculatively process the following steps, whereas with cmov the rest of the pipeline needs to wait for the result of cmov.

The code for the branchless version is quite a bit, so it may be more efficient just to use simple and compact code with a branch -- which can be predicted, as in the benchmarks where for constant input length the result will always be the same.
I'm pretty sure that there will be a benchmark constellation where the branchless version will be quite a bit faster.

In effect I think we should go with the code as is. It's easy and not hacky...and as the numbers show quite good. Furthermore I doubt a slight improvement here will show any effect on overall perf.

Once the runtime (JIT) will detect a pattern like return cond ? a : b and emit better code (maybe with tiering) we'll gain the perf-win for free.

More or less just for fun I tried a vectorized approach to make it branchless. I show it here because you are curios 😉

Benchmark is laid out so that (hopefully) the branch-predictor won't do a good job.

| Method | Mean | Error | StdDev | Ratio | RatioSD | |------------ |----------:|----------:|----------:|------:|--------:| | Default | 1.0953 ns | 0.0213 ns | 0.0199 ns | 1.00 | 0.00 | | Branchless | 0.8785 ns | 0.0136 ns | 0.0127 ns | 0.80 | 0.02 | | Vectorized | 0.9657 ns | 0.0210 ns | 0.0186 ns | 0.88 | 0.02 | | Vectorized1 | 0.8840 ns | 0.0156 ns | 0.0146 ns | 0.81 | 0.02 |

Benchmark-code

using System; using System.Runtime.CompilerServices; using System.Runtime.Intrinsics; using System.Runtime.Intrinsics.X86; using BenchmarkDotNet.Attributes; #if !DEBUG using BenchmarkDotNet.Running; #endif namespace ConsoleApp4 { class Program { static void Main(string[] args) { var bench = new Bench(); int mod3 = 0; Console.WriteLine(bench.Default(mod3)); Console.WriteLine(bench.Branchless(mod3)); Console.WriteLine(bench.Vectorized(mod3)); Console.WriteLine(bench.Vectorized1(mod3)); Console.WriteLine(); mod3 = 1; Console.WriteLine(bench.Default(mod3)); Console.WriteLine(bench.Branchless(mod3)); Console.WriteLine(bench.Vectorized(mod3)); Console.WriteLine(bench.Vectorized1(mod3)); Console.WriteLine(); mod3 = 2; Console.WriteLine(bench.Default(mod3)); Console.WriteLine(bench.Branchless(mod3)); Console.WriteLine(bench.Vectorized(mod3)); Console.WriteLine(bench.Vectorized1(mod3)); #if !DEBUG BenchmarkRunner.Run<Bench>(); #endif } } public class Bench { [Benchmark(Baseline = true, OperationsPerInvoke = 3)] public int Default() { int s = 0; for (int i = 0; i < 3; ++i) { s += this.Default(i); } return s; } [Benchmark(OperationsPerInvoke = 3)] public int Branchless() { int s = 0; for (int i = 0; i < 3; ++i) { s += this.Branchless(i); } return s; } [Benchmark(OperationsPerInvoke = 3)] public int Vectorized() { int s = 0; for (int i = 0; i < 3; ++i) { s += this.Vectorized(i); } return s; } [Benchmark(OperationsPerInvoke = 3)] public int Vectorized1() { int s = 0; for (int i = 0; i < 3; ++i) { s += this.Vectorized1(i); } return s; } public int Default(int mod3) => mod3 == 0 ? 0 : 3 - mod3; public int Branchless(int mod3) => BranchlessSelect(mod3 != 0, 3 - mod3, 0); public int Vectorized(int mod3) { Vector128<int> vMod3 = Vector128.CreateScalarUnsafe(mod3); Vector128<int> zero = Vector128<int>.Zero; Vector128<int> valueIfTrue = Vector128.CreateScalarUnsafe(3 - mod3); Vector128<int> eqZero = Sse2.CompareEqual(vMod3, zero); Vector128<int> neqZero = Sse2.CompareEqual(eqZero, zero); Vector128<int> result = Sse2.And(neqZero, valueIfTrue); return Sse2.ConvertToInt32(result); } public int Vectorized1(int mod3) { Vector128<int> vMod3 = Vector128.CreateScalarUnsafe(mod3); Vector128<int> zero = Vector128<int>.Zero; Vector128<int> valueIfTrue = Vector128.CreateScalarUnsafe(3 - mod3); Vector128<int> eqZero = Sse2.CompareEqual(vMod3, zero); Vector128<int> result = Sse2.AndNot(eqZero, valueIfTrue); return Sse2.ConvertToInt32(result); } [MethodImpl(MethodImplOptions.AggressiveInlining)] private static int BranchlessSelect(bool condition, int valueIfTrue, int valueIfFalse) { return ((-Unsafe.As<bool, byte>(ref condition)) & (valueIfTrue - valueIfFalse)) + valueIfFalse; } } }

Codegen

Use `Environment.Is64BitProcess` instead of sizeof(IntPtr) == 8 in unsafe context.

…r.Encode.cs Co-Authored-By: Günther Foidl <gue@korporal.at>

Co-Authored-By: Günther Foidl <gue@korporal.at>

gfoidl · 2020-02-18T10:04:45Z

Looks like CI has an assert failure in this test

Base64/tests/gfoidl.Base64.Tests/Internal/Base64UrlEncoderTests/GetEncodedLength.cs

Line 75 in 9e4584b

    
           public void SourceLength_is_negative___throws_ArgumentOutOfRange(int sourceLength)

ycrumeyrolle · 2020-02-18T10:38:02Z

I will take a look at this issue

gfoidl · 2020-02-18T17:06:15Z

The assert failure is caused by a test, which hits

Base64/source/gfoidl.Base64/Internal/Base64UrlEncoder/Base64UrlEncoder.Encode.cs

Line 245 in 2e74e9b

Debug.Assert(value <= int.MaxValue);

The fix is quite simple...
Get base64EncodedLen first, so the exception is thrown before the assert is hit.
35ae9e0 (you can cherry pick or just copy, whatever is easier for you).

source/gfoidl.Base64/Internal/Base64UrlEncoder/Base64UrlEncoder.Encode.cs

…oon as possible

gfoidl

LGTM -- if you don't have anything else for this PR I'll merge it later today.

source/gfoidl.Base64/Internal/Base64UrlEncoder/Base64UrlEncoder.Encode.cs

…r.Encode.cs

source/gfoidl.Base64/Internal/Base64UrlEncoder/Base64UrlEncoder.Encode.cs

…r.Encode.cs

gfoidl · 2020-03-17T11:29:21Z

@ycrumeyrolle thank you!

Minor improvement on the dataLength % 3.

ffb75a0

No change for x86, but save 2 `mov` instructions on x64.

ycrumeyrolle requested a review from gfoidl as a code owner February 17, 2020 22:41

gfoidl reviewed Feb 18, 2020

View reviewed changes

gfoidl added this to the v1.1.0 milestone Feb 18, 2020

gfoidl added the perf label Feb 18, 2020

gfoidl assigned ycrumeyrolle Feb 18, 2020

ycrumeyrolle and others added 3 commits February 18, 2020 10:01

Update Base64UrlEncoder.Encode.cs

dd8ea7c

Use `Environment.Is64BitProcess` instead of sizeof(IntPtr) == 8 in unsafe context.

Update source/gfoidl.Base64/Internal/Base64UrlEncoder/Base64UrlEncode…

8b29e0a

…r.Encode.cs Co-Authored-By: Günther Foidl <gue@korporal.at>

Apply suggestions from code review

2e74e9b

Co-Authored-By: Günther Foidl <gue@korporal.at>

FIx test assertion failure

e376b94

ycrumeyrolle commented Feb 18, 2020

View reviewed changes

source/gfoidl.Base64/Internal/Base64UrlEncoder/Base64UrlEncoder.Encode.cs Show resolved Hide resolved

Inverse method call for throwing the ArgumentOutOfRangeException as s…

e566b68

…oon as possible

gfoidl approved these changes Feb 18, 2020

View reviewed changes

gfoidl reviewed Feb 18, 2020

View reviewed changes

source/gfoidl.Base64/Internal/Base64UrlEncoder/Base64UrlEncoder.Encode.cs Show resolved Hide resolved

gfoidl reviewed Mar 13, 2020

View reviewed changes

source/gfoidl.Base64/Internal/Base64UrlEncoder/Base64UrlEncoder.Encode.cs Show resolved Hide resolved

Update source/gfoidl.Base64/Internal/Base64UrlEncoder/Base64UrlEncode…

27f487c

…r.Encode.cs

gfoidl reviewed Mar 17, 2020

View reviewed changes

source/gfoidl.Base64/Internal/Base64UrlEncoder/Base64UrlEncoder.Encode.cs Outdated Show resolved Hide resolved

Update source/gfoidl.Base64/Internal/Base64UrlEncoder/Base64UrlEncode…

ae21697

…r.Encode.cs

gfoidl merged commit 63b2ab2 into gfoidl:master Mar 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor improvement on the `dataLength % 3`. #114

Minor improvement on the `dataLength % 3`. #114

ycrumeyrolle commented Feb 17, 2020

gfoidl left a comment

gfoidl Feb 18, 2020

ycrumeyrolle Feb 18, 2020

gfoidl Feb 18, 2020 •

edited

gfoidl Feb 18, 2020

ycrumeyrolle Feb 18, 2020

gfoidl Feb 18, 2020

gfoidl Feb 19, 2020

gfoidl commented Feb 18, 2020

ycrumeyrolle commented Feb 18, 2020

gfoidl commented Feb 18, 2020

gfoidl left a comment •

edited

gfoidl commented Mar 17, 2020

Method	Value	Mean	Error	StdDev	Median	Ratio	RatioSD
Mod3	64	2.607 ns	0.0991 ns	0.2890 ns	2.519 ns	1.00	0.00
FastMod3	64	2.208 ns	0.0491 ns	0.1351 ns	2.165 ns	0.85	0.11
FastMod3_Branchless	64	2.611 ns	0.0502 ns	0.0445 ns	2.606 ns	1.07	0.06
FastMod3_BranchlessReversed	64	2.299 ns	0.0066 ns	0.0051 ns	2.300 ns	0.94	0.05

Mod3	65	2.357 ns	0.0494 ns	0.0462 ns	2.350 ns	1.00	0.00
FastMod3	65	2.108 ns	0.0411 ns	0.0365 ns	2.094 ns	0.89	0.02
FastMod3_Branchless	65	2.622 ns	0.0626 ns	0.0615 ns	2.607 ns	1.11	0.04
FastMod3_BranchlessReversed	65	2.313 ns	0.0467 ns	0.0437 ns	2.298 ns	0.98	0.02

Mod3	66	2.157 ns	0.0346 ns	0.0289 ns	2.166 ns	1.00	0.00
FastMod3	66	1.920 ns	0.0396 ns	0.0471 ns	1.918 ns	0.89	0.02
FastMod3_Branchless	66	2.637 ns	0.0588 ns	0.0577 ns	2.631 ns	1.23	0.04
FastMod3_BranchlessReversed	66	2.334 ns	0.0617 ns	0.0606 ns	2.327 ns	1.09	0.03

Minor improvement on the dataLength % 3. #114

Minor improvement on the dataLength % 3. #114

Conversation

ycrumeyrolle commented Feb 17, 2020

Description

Benchmarks

gfoidl left a comment

Choose a reason for hiding this comment

gfoidl Feb 18, 2020

Choose a reason for hiding this comment

ycrumeyrolle Feb 18, 2020

Choose a reason for hiding this comment

gfoidl Feb 18, 2020 • edited

Choose a reason for hiding this comment

gfoidl Feb 18, 2020

Choose a reason for hiding this comment

ycrumeyrolle Feb 18, 2020

Choose a reason for hiding this comment

gfoidl Feb 18, 2020

Choose a reason for hiding this comment

gfoidl Feb 19, 2020

Choose a reason for hiding this comment

gfoidl commented Feb 18, 2020

ycrumeyrolle commented Feb 18, 2020

gfoidl commented Feb 18, 2020

gfoidl left a comment • edited

Choose a reason for hiding this comment

gfoidl commented Mar 17, 2020

Minor improvement on the `dataLength % 3`. #114

Minor improvement on the `dataLength % 3`. #114

gfoidl Feb 18, 2020 •

edited

gfoidl left a comment •

edited