Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize Convert.ToBase64String using SSSE3 #21833

Open
wants to merge 25 commits into
base: master
from

Conversation

@EgorBo
Copy link
Contributor

EgorBo commented Jan 6, 2019

This PR improves Convert.ToBase64String performance using SSSE3 instructions.
It's based on "Base64 encoding with SIMD instructions" article by Wojciech Muła

Benchmark:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Collections.Generic;

namespace ConsoleApp143
{
    public class ToBase64StringBenchmarks
    {
        public static IEnumerable<object[]> TestDataForGraph()
        {
            var rand = new Random(314666); // fixed "seed"
            for (int i = 0; i < 100; i++)
            {
                var data = new byte[i];
                for (int j = 0; j < i; j++)
                    data[j] = (byte)rand.Next(0, byte.MaxValue);
                yield return new object[] { data, i };
            }
        }

        [Benchmark]
        [ArgumentsSource(nameof(TestDataForGraph))]
        public string ToBase64(byte[] testData, int inputSize /* argument for report */) =>
            Convert.ToBase64String(testData, Base64FormattingOptions.InsertLineBreaks);

        static unsafe void Main(string[] args) => 
            BenchmarkSwitcher.FromAssembly(typeof(ToBase64StringBenchmarks).Assembly).Run(args);
    }
}

Windows 10.0.17134.523, Core i7-8700K 3.7GHz (Coffee Lake):

image

macOS 10.13.6, Core i7-4980HQ 2.8GHz (Haswell):

image

SSSE3-based implementation is limited with input.Length>36 condition in order to avoid regressions for smaller values (the best value for my Skylake, Coffee Lake and Haswell based machines).

EgorBo added 5 commits Jan 6, 2019
@stephentoub

This comment has been minimized.

Copy link
Member

stephentoub commented Jan 7, 2019

For smaller input arrays according to my benchmark, performance shows up after input.Length >= 50

The graph doesn't show below 24... is there a regression for small values? (It's pretty common to use base-64 encoding with small values, such as in various HTTP headers.)

@tannergooding

This comment has been minimized.

Copy link
Member

tannergooding commented Jan 7, 2019

BTW, when I did port I had to manually reverse all values in _mm256_setr - maybe it makes sense to add Vector.CreateReversed in order to simplify such cases?

The current Create methods for Vector64, Vector128, and Vector256 take the values in the same order as the native setr methods (which is e0, e1, ...)

@gfoidl

This comment has been minimized.

Copy link
Contributor

gfoidl commented Jan 7, 2019

FYI: dotnet/corefx#32365 (will do when I get some time for this) (and https://github.com/gfoidl/Base64)

@EgorBo

This comment has been minimized.

Copy link
Contributor Author

EgorBo commented Jan 7, 2019

@gfoidl oh, didn't see your work. I did this just to practice and test Intrinsics API 🙂

@EgorBo

This comment has been minimized.

Copy link
Contributor Author

EgorBo commented Feb 10, 2019

@tannergooding @stephentoub @fiigii @gfoidl I updated the PR and its description (added graphs). Could you please take a look?
I tried to keep it simple and small and to avoid any regressions for small values

@EgorBo EgorBo changed the title Vectorize Convert.ToBase64String using AVX2 Vectorize Convert.ToBase64String using SSSE3 Feb 10, 2019
Vector128<byte> t2 = Sse2.And(inputVector, tt2);
// t3 = [00dddddd|00000000|00bbbbbb|00000000]
Vector128<byte> t3 = Sse2.MultiplyLow(t2.AsUInt16(), tt3).AsByte();
// indices = [00dddddd|00cccccc|00bbbbbb|00aaaaaa] = t1 | t3

This comment has been minimized.

Copy link
@tannergooding

tannergooding Mar 27, 2019

Member

nit: the paper looks to differentiate between BB and bbbb as well as CC and cccc

This comment has been minimized.

Copy link
@EgorBo

EgorBo Apr 24, 2019

Author Contributor

I am not sure I follow, the comments are 100% copied from https://github.com/WojciechMula/base64simd/blob/master/encode/encode.sse.cpp#L20-L59 🙂

This comment has been minimized.

Copy link
@stephentoub

stephentoub Apr 24, 2019

Member

I am not sure I follow, the comments are 100% copied from

Aren't we then missing the appropriate 3rd-party notice information, copying the relevant licensing information into this file, etc.?
cc: @richlander

This comment has been minimized.

Copy link
@EgorBo

EgorBo Apr 24, 2019

Author Contributor

updated the THIRD-PARTY-NOTICES.txt

This comment has been minimized.

Copy link
@tannergooding

tannergooding May 26, 2019

Member

I am not sure I follow, the comments are 100% copied from

I was commenting that the paper, in some places, differentiates between uppercase BB and lowercase bbbb. The source code doesn't seem to have maintained that in all the various places.

For example, the paper calls out:

Input to this step are 32-bit words, each having following layout:

[bbbbcccc|CCdddddd|aaaaaaBB|bbbbcccc]

where bits aaaaaa, BBbbbb, ccccCC and dddddd are 6-bit indices. The output of this step has to be:

[00dddddd|00ccccCC|00BBbbbb|00aaaaaa]

This looks to be particularly important for tracking which bits flow where

This comment has been minimized.

Copy link
@tannergooding

tannergooding Nov 4, 2019

Member

It would still be nice to see this addressed.

Copy link
Member

tannergooding left a comment

Overall, LGTM. Just a couple of questions/nits

@stephentoub

This comment has been minimized.

Copy link
Member

stephentoub commented Apr 23, 2019

@EgorBo, are you still working on this?

@EgorBo

This comment has been minimized.

Copy link
Contributor Author

EgorBo commented Apr 24, 2019

@stephentoub updated the comments.
I guess this PR intersects with @gfoidl dotnet/corefx#34529 who started to work on this earlier (and my PR focuses only on Encoding).

@gfoidl

This comment has been minimized.

Copy link
Contributor

gfoidl commented Apr 24, 2019

@EgorBo I wouldn't call it "intersects", as the other PR is for span-based byte -> byte encoding / decoding, whilst this one is for byte -> string (with line-breaks). So similar, but different targets.

If there would be no need for line-breaks, so the base64 encoding in Convert could be based on System.Buffers.Text.Base64.

EgorBo and others added 2 commits Apr 24, 2019
@danmosemsft

This comment has been minimized.

Copy link
Member

danmosemsft commented May 28, 2019

Resolved merge conflict so we can get test results.

@danmosemsft

This comment has been minimized.

Copy link
Member

danmosemsft commented May 28, 2019

@tannergooding if tests pass is this ready to merge?

@tannergooding

This comment has been minimized.

Copy link
Member

tannergooding commented May 28, 2019

I'll give this one more pass after lunch.

@sandreenko

This comment has been minimized.

Copy link
Member

sandreenko commented Nov 2, 2019

@EgorBo do you think that PR can be finished before the consolidation (in next 2 weeks)?

EgorBo added 5 commits Nov 4, 2019
# Conflicts:
#	THIRD-PARTY-NOTICES.TXT
#	src/System.Private.CoreLib/shared/System/Convert.cs
@sandreenko sandreenko requested a review from tannergooding Nov 4, 2019
@@ -2492,19 +2494,146 @@ public static unsafe bool TryToBase64Chars(ReadOnlySpan<byte> bytes, Span<char>
}
}

internal static readonly Vector128<byte> s_base64ShuffleMask = Vector128.Create((byte)

This comment has been minimized.

Copy link
@tannergooding

tannergooding Nov 4, 2019

Member

A short comment describing each constant would be useful.

It's also not clear why these are static readonly, but several of the others (such as tt0-tt8) are not

This comment has been minimized.

Copy link
@saucecontrol

saucecontrol Nov 4, 2019

Member

Given #17225 and #26976, it would be more efficient processing and space-wise to use the ROS<byte> read-only property trick on these, especially since they're only used by code behind a Ssse3.IsSupported check.

Vector128<byte> indices = Sse2.Or(t1, t3);

// lookup function "Single pshufb method" (lookup_pshufb_improved)
Vector128<byte> result = Sse2.SubtractSaturate(indices, tt5);

This comment has been minimized.

Copy link
@tannergooding

tannergooding Nov 4, 2019

Member

Any reason this isn't a static local function (since it was a separate function in the original algorithm)? Inlining?

result = Sse2.Shuffle(result.AsUInt32(), 0x4E /*_MM_SHUFFLE(1,0,3,2)*/).AsByte();
result = Ssse3.Shuffle(result, localTwoBytesStringMaskLo);

if (insertLineBreaks && (charcount += 16) >= base64LineBreakPosition)

This comment has been minimized.

Copy link
@tannergooding

tannergooding Nov 4, 2019

Member

Having the side effect only hit if insertLineBreaks is true, but required for both the true and false scenarios is non-obvious.

It would be nice to move the charCount += 16 out separately

@saucecontrol

This comment has been minimized.

Copy link
Member

saucecontrol commented Nov 4, 2019

SSSE3-based implementation is limited with input.Length>36 condition in order to avoid regressions for smaller values (the best value for my Skylake, Coffee Lake and Haswell based machines).

Is the 36-byte cutover point appropriate for 32-bit as well? There are more than 8 active XMM registers used in the inner loop, so there will likely be some stack shuffling offsetting the SSE gains.

@maryamariyan

This comment has been minimized.

Copy link
Member

maryamariyan commented Nov 6, 2019

Thank you for your contribution. As announced in #27549 this repository will be moving to dotnet/runtime on November 13. If you would like to continue working on this PR after this date, the easiest way to move the change to dotnet/runtime is:

  1. In your coreclr repository clone, create patch by running git format-patch origin
  2. In your runtime repository clone, apply the patch by running git apply --directory src/coreclr <path to the patch created in step 1>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants
You can’t perform that action at this time.