Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Vectorize Convert.ToBase64String using SSSE3 #21833

Closed
wants to merge 25 commits into from
Closed
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
d218652
Vectorize Convert.ToBase64String
EgorBo Jan 6, 2019
8217470
Fallback to ConvertToBase64Array for corner cases
EgorBo Jan 6, 2019
d0d89ca
Only Base64FormattingOptions.None is supported so far
EgorBo Jan 6, 2019
1bf78f5
fix typo
EgorBo Jan 6, 2019
1c187ea
Clean up
EgorBo Jan 6, 2019
3fcdabf
Add initial SSSE3-based impl
EgorBo Jan 7, 2019
6a5e3af
Merge remote-tracking branch 'dotnet/master' into base64-vectorize
EgorBo Feb 3, 2019
ed74c5b
Merge SSSE3-based impl with ConvertToBase64Array
EgorBo Feb 9, 2019
02547ba
remove avx
EgorBo Feb 9, 2019
9b8c9d1
Add copy-right and use SSSE3 when inputLength >= 36
EgorBo Feb 9, 2019
2ab1c5e
move to a separate method, also move constant vectors
EgorBo Feb 9, 2019
8acc598
rename static readonly fields (add s_ prefix)
EgorBo Feb 10, 2019
df53ee9
remove Ssse3.IsSupported from static readonly vectors
EgorBo Feb 10, 2019
d9204b4
Merge remote-tracking branch 'dotnet/master' into base64-vectorize
EgorBo Feb 21, 2019
36eb502
copy static readonly vectors to local variables to keep them in regis…
EgorBo Feb 21, 2019
f29eab7
Merge remote-tracking branch 'dotnet/master' into base64-vectorize
EgorBo Apr 24, 2019
cd42c3c
Add more comments
EgorBo Apr 24, 2019
72ea550
update THIRD-PARTY-NOTICES.TXT
EgorBo Apr 24, 2019
aef8747
update comments
EgorBo Apr 24, 2019
a281f2a
Merge branch 'master' into base64-vectorize
danmoseley May 28, 2019
1f80164
Merge branch 'master' of github.com:EgorBo/coreclr into base64-vectorize
EgorBo Nov 4, 2019
3239269
Fix build error (StoreScalar)
EgorBo Nov 4, 2019
d772632
Merge branch 'master' of github.com:dotnet/coreclr into base64-vectorize
EgorBo Nov 4, 2019
77207a2
Update THIRD-PARTY-NOTICES.TXT
EgorBo Nov 4, 2019
55c7dac
formatting
EgorBo Nov 4, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
32 changes: 32 additions & 0 deletions THIRD-PARTY-NOTICES.TXT
Expand Up @@ -281,3 +281,35 @@ LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

License notice for vectorized base64 encoding
--------------------------------------------------------

Copyright (c) 2005-2007, Nick Galbreath
Copyright (c) 2013-2017, Alfred Klomp
Copyright (c) 2015-2017, Wojciech Mula
Copyright (c) 2016-2017, Matthieu Darbois
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:

* Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
139 changes: 134 additions & 5 deletions src/System.Private.CoreLib/shared/System/Convert.cs
@@ -1,4 +1,4 @@
// Licensed to the .NET Foundation under one or more agreements.
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

Expand All @@ -12,6 +12,8 @@
using System.Security;
using System.Diagnostics;
using System.Diagnostics.CodeAnalysis;
using System.Runtime.Intrinsics.X86;
using System.Runtime.Intrinsics;

namespace System
{
Expand Down Expand Up @@ -2531,21 +2533,148 @@ public static unsafe bool TryToBase64Chars(ReadOnlySpan<byte> bytes, Span<char>
charsWritten = ConvertToBase64Array(outChars, inData, 0, bytes.Length, insertLineBreaks);
return true;
}
}

internal static readonly Vector128<byte> s_base64ShuffleMask = Vector128.Create((byte)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A short comment describing each constant would be useful.

It's also not clear why these are static readonly, but several of the others (such as tt0-tt8) are not

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given https://github.com/dotnet/coreclr/issues/17225 and https://github.com/dotnet/coreclr/issues/26976, it would be more efficient processing and space-wise to use the ROS<byte> read-only property trick on these, especially since they're only used by code behind a Ssse3.IsSupported check.

1, 0, 2, 1, 4, 3, 5, 4, 7, 6, 8, 7, 10, 9, 11, 10);

internal static readonly Vector128<byte> s_base64ShiftLut = Vector128.Create(
(sbyte)'a' - 26, (sbyte)'0' - 52,
(sbyte)'0' - 52, (sbyte)'0' - 52,
(sbyte)'0' - 52, (sbyte)'0' - 52,
(sbyte)'0' - 52, (sbyte)'0' - 52,
(sbyte)'0' - 52, (sbyte)'0' - 52,
(sbyte)'0' - 52, (sbyte)'+' - 62,
(sbyte)'/' - 63, (sbyte)'A', 0, 0).AsByte();

internal static readonly Vector128<byte> s_base64TwoBytesStringMaskLo = Vector128.Create(
0, 0x80, 1, 0x80,
2, 0x80, 3, 0x80,
4, 0x80, 5, 0x80,
6, 0x80, 7, 0x80);

// Based on "Base64 encoding with SIMD instructions" article by Wojciech Muła http://0x80.pl/notesen/2016-01-12-sse-base64-encoding.html (see THIRD-PARTY-NOTICES.txt)
// The original code can be found here: https://github.com/WojciechMula/base64simd/blob/master/encode/encode.sse.cpp (and lookup_pshufb_improved as a lookup function)
private static unsafe (int i, int j, int charcount) ConvertToBase64ArraySsse3(char* outChars, byte* inData, int length, int offset, bool insertLineBreaks)
EgorBo marked this conversation as resolved.
Show resolved Hide resolved
{
int i = offset, j = 0, charcount = 0;
const int stride = 4 * 3;

byte* outputBytes = (byte*)outChars;

Vector128<byte> tt0 = Vector128.Create(0x0fc0fc00).AsByte();
Vector128<ushort> tt1 = Vector128.Create(0x04000040).AsUInt16();
Vector128<byte> tt2 = Vector128.Create(0x003f03f0).AsByte();
Vector128<ushort> tt3 = Vector128.Create(0x01000010).AsUInt16();
Vector128<byte> tt5 = Vector128.Create((byte)51);
Vector128<sbyte> tt7 = Vector128.Create((sbyte)26);
Vector128<byte> tt8 = Vector128.Create((byte)13);

// static readonly Vector128 field + assigning its value to a local variable is a C# pattern for `const __mX`
Vector128<byte> localShiftLut = s_base64ShiftLut;
Vector128<byte> localShuffleMask = s_base64ShuffleMask;
Vector128<byte> localTwoBytesStringMaskLo = s_base64TwoBytesStringMaskLo;

for (; i <= length - stride; i += stride)
{
// input = [xxxx|DDDC|CCBB|BAAA]
Vector128<byte> inputVector = Sse2.LoadVector128(inData + i);
EgorBo marked this conversation as resolved.
Show resolved Hide resolved

// bytes from groups A, B and C are needed in separate 32-bit lanes
// in = [DDDD|CCCC|BBBB|AAAA]
//
// an input triplet has layout
// [????????|ccdddddd|bbbbcccc|aaaaaabb]
// byte 3 byte 2 byte 1 byte 0 -- byte 3 comes from the next triplet
//
// shuffling changes the order of bytes: 1, 0, 2, 1
// [bbbbcccc|ccdddddd|aaaaaabb|bbbbcccc]
// ^^^^ ^^^^^^^^ ^^^^^^^^ ^^^^
// processed bits
inputVector = Ssse3.Shuffle(inputVector, localShuffleMask);

// unpacking

// t0 = [0000cccc|cc000000|aaaaaa00|00000000]
EgorBo marked this conversation as resolved.
Show resolved Hide resolved
Vector128<byte> t0 = Sse2.And(inputVector, tt0);
// t1 = [00000000|00cccccc|00000000|00aaaaaa]
Vector128<byte> t1 = Sse2.MultiplyHigh(t0.AsUInt16(), tt1).AsByte();
// t2 = [00000000|00dddddd|000000bb|bbbb0000]
Vector128<byte> t2 = Sse2.And(inputVector, tt2);
// t3 = [00dddddd|00000000|00bbbbbb|00000000]
Vector128<byte> t3 = Sse2.MultiplyLow(t2.AsUInt16(), tt3).AsByte();
// indices = [00dddddd|00cccccc|00bbbbbb|00aaaaaa] = t1 | t3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the paper looks to differentiate between BB and bbbb as well as CC and cccc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow, the comments are 100% copied from https://github.com/WojciechMula/base64simd/blob/master/encode/encode.sse.cpp#L20-L59 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow, the comments are 100% copied from

Aren't we then missing the appropriate 3rd-party notice information, copying the relevant licensing information into this file, etc.?
cc: @richlander

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the THIRD-PARTY-NOTICES.txt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow, the comments are 100% copied from

I was commenting that the paper, in some places, differentiates between uppercase BB and lowercase bbbb. The source code doesn't seem to have maintained that in all the various places.

For example, the paper calls out:

Input to this step are 32-bit words, each having following layout:

[bbbbcccc|CCdddddd|aaaaaaBB|bbbbcccc]

where bits aaaaaa, BBbbbb, ccccCC and dddddd are 6-bit indices. The output of this step has to be:

[00dddddd|00ccccCC|00BBbbbb|00aaaaaa]

This looks to be particularly important for tracking which bits flow where

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would still be nice to see this addressed.

Vector128<byte> indices = Sse2.Or(t1, t3);

// lookup function "Single pshufb method" (lookup_pshufb_improved)
Vector128<byte> result = Sse2.SubtractSaturate(indices, tt5);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this isn't a static local function (since it was a separate function in the original algorithm)? Inlining?

Vector128<sbyte> compareResult = Sse2.CompareGreaterThan(tt7, indices.AsSByte());
result = Sse2.Or(result, Sse2.And(compareResult.AsByte(), tt8));
result = Ssse3.Shuffle(localShiftLut, result);
result = Sse2.Add(result, indices);
// end of lookup function

// save as two-bytes string, e.g.:
// 1,2,3,4,5..16 => 1,0,2,0,3,0..16,0
Sse2.Store(outputBytes + j, Ssse3.Shuffle(result, localTwoBytesStringMaskLo));
j += Vector128<byte>.Count;

// Do it for the second part of the vector (rotate it first in order to re-use asciiToStringMaskLo)
result = Sse2.Shuffle(result.AsUInt32(), 0x4E /*_MM_SHUFFLE(1,0,3,2)*/).AsByte();
result = Ssse3.Shuffle(result, localTwoBytesStringMaskLo);

if (insertLineBreaks && (charcount += 16) >= base64LineBreakPosition)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the case with insertLineBreaks into a separate method, so that the codegen for either case can be optimized.

This may also prevent some spills in the simd-registers (if there are any).

Copy link
Member Author

@EgorBo EgorBo Feb 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice any noticeable performance regressions after I added this block for any values when insertLineBreaks is false

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the side effect only hit if insertLineBreaks is true, but required for both the true and false scenarios is non-obvious.

It would be nice to move the charCount += 16 out separately

{
// Normally we save 32 bytes per iteration
// but `insertLineBreaks` needs `\r\n` (4 bytes) between each 76*2=152 bytes. 152/32 = 4.75 (means not a multiply of 32)
// we need to insert `\r\n` in the middle of Vector128<byte> somehow
// but the following code just saves a half of the vector, then appends `\r\n` manually
// and the second part of the vector is ignored (this is why 'i' is decremented)
charcount = 0;
var shuffleResult = result.AsUInt64();
Sse2.StoreLow((ulong*)(outputBytes + j), shuffleResult);
EgorBo marked this conversation as resolved.
Show resolved Hide resolved
j += Vector128<byte>.Count / 2;
outputBytes[j++] = (byte)'\r';
outputBytes[j++] = 0;
outputBytes[j++] = (byte)'\n';
outputBytes[j++] = 0;
i -= stride / 4;
}
else
{
Sse2.Store(outputBytes + j, result);
j += Vector128<byte>.Count;
}
}
// SIMD-based algorithm used `j` to count bytes, the software fallback uses it count chars
j /= 2;

return (i, j, charcount);
}

private static unsafe int ConvertToBase64Array(char* outChars, byte* inData, int offset, int length, bool insertLineBreaks)
{
int charcount = 0;
int i = offset;
int j = 0;

if (Ssse3.IsSupported && length - offset >= 36)
{
// Tuple is faster then passing i,j,charcount by ref.
// SSSE impl is moved to a separate method in order to avoid regression for smaller inputs
(i, j, charcount) = ConvertToBase64ArraySsse3(outChars, inData, length, offset, insertLineBreaks);
if (i == length)
return j;
}

int lengthmod3 = length % 3;
int calcLength = offset + (length - lengthmod3);
int j = 0;
int charcount = 0;
//Convert three bytes at a time to base64 notation. This will consume 4 chars.
int i;

// get a pointer to the base64Table to avoid unnecessary range checking
fixed (char* base64 = &base64Table[0])
{
for (i = offset; i < calcLength; i += 3)
for (; i < calcLength; i += 3)
{
if (insertLineBreaks)
{
Expand Down