New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Vectorize HexConverter.EncodeToUtf16 using SSSE3 #44111

Merged

EgorBo merged 31 commits into dotnet:master from EgorBo:vectorize-hex-converter

Jan 24, 2021

Member

EgorBo commented Oct 31, 2020 •

edited

Loading

Based on my dotnet/aspnetcore#18406 (comment) (and improved by @benaadams)

Self-contained benchmark: https://gist.github.com/EgorBo/b0da2cd604f713a7767df907d5a3dfa6

| Method |     input |       Mean |     Error |    StdDev |
|------- |---------- |-----------:|----------:|----------:|
|    BCL |   Byte[1] |   5.798 ns | 0.0166 ns | 0.0155 ns |
|  SSSE3 |   Byte[1] |   6.330 ns | 0.0097 ns | 0.0086 ns |

|    BCL |   Byte[2] |   7.359 ns | 0.0103 ns | 0.0092 ns |
|  SSSE3 |   Byte[2] |   8.752 ns | 0.0119 ns | 0.0106 ns |

|    BCL |   Byte[3] |   9.570 ns | 0.0992 ns | 0.0879 ns |
|  SSSE3 |   Byte[3] |   9.702 ns | 0.0693 ns | 0.0649 ns |

|    BCL |   Byte[4] |  10.743 ns | 0.0339 ns | 0.0317 ns |
|  SSSE3 |   Byte[4] |   9.963 ns | 0.0341 ns | 0.0319 ns |

|    BCL |   Byte[5] |  12.754 ns | 0.0328 ns | 0.0291 ns |
|  SSSE3 |   Byte[5] |  12.358 ns | 0.0345 ns | 0.0322 ns |

|    BCL |   Byte[6] |  14.445 ns | 0.0251 ns | 0.0210 ns |
|  SSSE3 |   Byte[6] |  14.029 ns | 0.0034 ns | 0.0030 ns |

|    BCL |   Byte[7] |  16.286 ns | 0.0342 ns | 0.0320 ns |
|  SSSE3 |   Byte[7] |  15.236 ns | 0.0131 ns | 0.0116 ns |

|    BCL |   Byte[8] |  17.490 ns | 0.0677 ns | 0.0633 ns |
|  SSSE3 |   Byte[8] |  11.907 ns | 0.0072 ns | 0.0064 ns |

|    BCL |   Byte[9] |  19.622 ns | 0.0416 ns | 0.0389 ns |
|  SSSE3 |   Byte[9] |  14.095 ns | 0.0476 ns | 0.0445 ns |

|    BCL |  Byte[10] |  21.176 ns | 0.0521 ns | 0.0488 ns |
|  SSSE3 |  Byte[10] |  15.259 ns | 0.0149 ns | 0.0140 ns |

|    BCL |  Byte[22] |  40.949 ns | 0.0787 ns | 0.0657 ns |
|  SSSE3 |  Byte[22] |  19.479 ns | 0.0531 ns | 0.0497 ns |

|    BCL |  Byte[32] |  57.793 ns | 0.2507 ns | 0.2222 ns |
|  SSSE3 |  Byte[32] |  20.652 ns | 0.0449 ns | 0.0375 ns |

|    BCL |  Byte[64] | 112.013 ns | 0.4231 ns | 0.3751 ns |
|  SSSE3 |  Byte[64] |  31.294 ns | 0.1032 ns | 0.0915 ns |

|    BCL | Byte[366] | 619.513 ns | 0.8064 ns | 0.7543 ns |
|  SSSE3 | Byte[366] | 143.214 ns | 0.3288 ns | 0.2915 ns |

Some of the directly affected APIs:

System.Net.Http.AuthenticationHelper.ComputeHash(string data, string algorithm)
System.Converter.ToHexString(ReadOnlySpan<byte> bytes) -- new API
System.Converter.ToHexString(byte[] inArray) -- new API
Some crypto-related APIs

Will add more test-cases to ConvertToHexStringTests

EgorBo added 3 commits

October 31, 2020 22:14


          Vectorize HexConverter.EncodeToUtf16 using SSSE3

6db1e03


          Clean up

ea1f6e1


          Fix typos

d6aefad

Collaborator

Dotnet-GitSync-Bot commented Oct 31, 2020

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

EgorBo added the area-System.Runtime label

EgorBo added 7 commits

October 31, 2020 22:56


          Fix compilation issue

96322dd


          Clean up

b67c0d3


          Add usings

24fbb3c


          Extract into a separate method

4d7e386


          Fix compilation issues

288de9c


          Clean up

577a4e4


          fix build issue

3aa5edb

Member

GrabYourPitchforks commented Oct 31, 2020

FWIW, I believe the common use cases for this API are when the input is <= 32 bytes, and occasionally 64 bytes. The number of use cases past that point are vanishingly small. Egor's table doesn't include entries for 32 or 64 bytes, but interpolating the data suggests SSSE3-intrinsicified should see 3x throughput compared to baseline.

GrabYourPitchforks reviewed

View reviewed changes

src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved

src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved

src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved

GrabYourPitchforks added enhancement tenet-performance labels

EgorBo added 3 commits

November 1, 2020 00:57


          another attempt to fix CI

006e9d4


          Address feedback

1962bc6


          Remove default value for casing arg in EncodeToUtf16_Ssse3

50828d7

EgorBo mentioned this pull request

Revise how constant SIMD vectors are defined in BCL #44115

Closed

EgorBo added 3 commits

November 1, 2020 04:52


          fix ifdefs

d9d0af8


          fix ifdefs

8f61c1b


          fix ifdefs

996c164

am11 reviewed

View reviewed changes

src/libraries/Common/src/System/HexConverter.cs Show resolved Hide resolved

am11 reviewed

View reviewed changes

src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved

EgorBo and others added 5 commits

November 1, 2020 13:09


          Update src/libraries/Common/src/System/HexConverter.cs

8a8baf6

Co-authored-by: Adeel Mujahid <3840695+am11@users.noreply.github.com>


          Update HexConverter.cs

79f352f


          Update HexConverter.cs

3bbb843


          Add a test

fdd5d6e


          Update Convert.ToHexString.cs

587f9ab

runfoapp bot mentioned this pull request

Inability to unzip assets during build on Unix x64 #32805

Closed

gfoidl reviewed

View reviewed changes

src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved

src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved

src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved

src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved

runfoapp bot mentioned this pull request

System.IO.Compression.GzipStreamUnitTests.Write_DataReadFromDesiredOffset failed #44173

Closed

EgorBo added 5 commits

November 3, 2020 03:08


          Address feedback


          Address feedback

ee65beb


          use nint

7ff83bd


          Merge branch 'master' of github.com:dotnet/runtime into vectorize-hex…

c99e9a4

…-converter


          Update HexConverter.cs

adefb21

Member Author

EgorBo commented Nov 5, 2020

Does it look good?

GrabYourPitchforks approved these changes

View reviewed changes

Member

GrabYourPitchforks left a comment

LGTM! Left some really low-pri comments, feel free to ignore as you see fit. Thanks for driving this. :)

src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved

src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved

EgorBo added 2 commits

November 10, 2020 01:59


          Merge branch 'master' of github.com:dotnet/runtime into vectorize-hex…

f2785e5

…-converter


          remove redundant int cast

fb2b13e

Member Author

EgorBo commented Nov 26, 2020

@stephentoub @jkotas can this be merged?

jkotas reviewed

View reviewed changes

src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved

Member

jkotas commented Nov 26, 2020

Some of the directly affected APIs

What are the number for these public APIs?


          Merge branch 'master' of github.com:dotnet/runtime into vectorize-hex…

a9ceecc

…-converter

gfoidl reviewed

View reviewed changes

src/libraries/Common/src/System/HexConverter.cs Outdated

+                          // The high bytes (0x00) of the chars have also been converted
+                          // to ascii hex '0', so clear them out.
+                          hex = Sse2.And(hex, Vector128.Create((ushort)0xFF).AsByte());

Member

gfoidl Nov 26, 2020

Idea for micro-optimization.

Vector.Create creates the constant, that needs to be read from memory. With the other "constants" used here, it's likely they will be brought together into L1D, so the cost should be low.

But it still evicts a cacheline from L1D's set, so one could avoid this load by using bit-shifting for the masking. Something like

// (pseudo code style)
tmp = Sse2.ShiftLeftLogical(vec, 8)
Sse2.ShiftRightLogical(tmp, 8)

Latency and throughput for the shift is good, but this introduces a register dependency.

In micro-benchmarks the L1D will be hot, so it won't harm here, but in real usages that's likely not the case, so avoiding the load may be a plus.

TBH I don't know if it's worth it and so far it's more theory...

Member Author

EgorBo Nov 26, 2020

Nice idea! However, I don't see any differences in benchmarks (even slightly slower).

Sse2.And(hex, Vector128.Create((ushort)0xFF))

is emitted as vpand xmm0, xmm0, xmmword ptr [reloc @RWD00] (memory load without an additional register)
It probably makes sense to hoist them from loop too but it also doesn't affect the benchmarks (almost).

Member

gfoidl Nov 26, 2020

doesn't affect the benchmarks

L1D is hot, so I expected this.
For real world usage I doubt the benchmarks will show a measurable difference too.
But according the theory a different cacheline could be kept in the sets of L1D, as it's not evicted, which could be good in the overall.

To show this in a benchmark one would need the address of the @RWD00 so the cache set is known (with cpu data from cpu-z, etc.). Then load from the address and other data(s), that "fight" for the same cache set, so this set is evicted constantly. But this is almost impossible to do reliable in a benchmark.

Member Author

EgorBo commented Nov 26, 2020 •

edited

Loading

Some of the directly affected APIs

What are the number for these public APIs?

The main goal was to optimize the new APIs, I couldn't find any noticeable difference for others,
AuthenticationHelper.ComputeHash is actually private but it shouldn't regress since it always work with 32/64 bytes (SHA256/SHA512 or 16 for MD5 if it's used there).
https://gist.github.com/EgorBo/cb08048d3fc2d49a12921a859e184219

|                   Method |     array |        Mean |
|------------------------- |---------- |------------:|

|            Convert_ToHex |   Byte[4] |    16.36 ns | master
|            Convert_ToHex |   Byte[4] |    16.28 ns | PR

|            Convert_ToHex |   Byte[8] |    23.18 ns | master
|            Convert_ToHex |   Byte[8] |    18.01 ns | PR

|            Convert_ToHex |  Byte[10] |    26.63 ns | master
|            Convert_ToHex |  Byte[10] |    21.15 ns | PR

|            Convert_ToHex |  Byte[20] |    45.23 ns | master
|            Convert_ToHex |  Byte[20] |    22.53 ns | PR

|            Convert_ToHex |  Byte[32] |    64.74 ns | master
|            Convert_ToHex |  Byte[32] |    26.28 ns | PR

|            Convert_ToHex |  Byte[64] |   120.19 ns | master
|            Convert_ToHex |  Byte[64] |    38.17 ns | PR

|            Convert_ToHex | Byte[512] |   881.49 ns | master
|            Convert_ToHex | Byte[512] |   223.56 ns | PR

| X509Certificate_ToString |           | 1,008.64 ns | master
| X509Certificate_ToString |           | 1,006.13 ns | PR


          Fix indent

d7fb72c

GrabYourPitchforks mentioned this pull request

Add vectorized implementation of hex encoding/decoding #39702

Closed

jeffhandley assigned GrabYourPitchforks

Member

GrabYourPitchforks commented Jan 23, 2021

@jeffhandley did you mean to assign this to me?

@EgorBo are we waiting for anything else before merge?

jeffhandley assigned EgorBo and unassigned GrabYourPitchforks

Member Author

EgorBo commented Jan 24, 2021

@jeffhandley @GrabYourPitchforks it's finished from my end

Member

jeffhandley commented Jan 24, 2021

Were you just waiting for another signoff then, @EgorBo?

Member Author

EgorBo commented Jan 24, 2021

Were you just waiting for another signoff then, @EgorBo?

Ah, I never merged non-mono related PRs 🙂, can I merge it now then?

jeffhandley approved these changes

View reviewed changes

Member

jeffhandley left a comment

Ah, cool. Yeah, go for it, @EgorBo.

EgorBo merged commit 2f1def8 into dotnet:master

ghost locked as resolved and limited conversation to collaborators

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Reviewers

stephentoub stephentoub left review comments

am11 am11 left review comments

gfoidl gfoidl left review comments

jkotas jkotas left review comments

jeffhandley jeffhandley approved these changes

GrabYourPitchforks GrabYourPitchforks approved these changes

Labels

area-System.Runtime enhancement tenet-performance