Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize HexConverter.EncodeToUtf16 using SSSE3 #44111

Merged
merged 31 commits into from
Jan 24, 2021

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Oct 31, 2020

Based on my dotnet/aspnetcore#18406 (comment) (and improved by @benaadams)

Self-contained benchmark: https://gist.github.com/EgorBo/b0da2cd604f713a7767df907d5a3dfa6

| Method |     input |       Mean |     Error |    StdDev |
|------- |---------- |-----------:|----------:|----------:|
|    BCL |   Byte[1] |   5.798 ns | 0.0166 ns | 0.0155 ns |
|  SSSE3 |   Byte[1] |   6.330 ns | 0.0097 ns | 0.0086 ns |

|    BCL |   Byte[2] |   7.359 ns | 0.0103 ns | 0.0092 ns |
|  SSSE3 |   Byte[2] |   8.752 ns | 0.0119 ns | 0.0106 ns |

|    BCL |   Byte[3] |   9.570 ns | 0.0992 ns | 0.0879 ns |
|  SSSE3 |   Byte[3] |   9.702 ns | 0.0693 ns | 0.0649 ns |

|    BCL |   Byte[4] |  10.743 ns | 0.0339 ns | 0.0317 ns |
|  SSSE3 |   Byte[4] |   9.963 ns | 0.0341 ns | 0.0319 ns |

|    BCL |   Byte[5] |  12.754 ns | 0.0328 ns | 0.0291 ns |
|  SSSE3 |   Byte[5] |  12.358 ns | 0.0345 ns | 0.0322 ns |

|    BCL |   Byte[6] |  14.445 ns | 0.0251 ns | 0.0210 ns |
|  SSSE3 |   Byte[6] |  14.029 ns | 0.0034 ns | 0.0030 ns |

|    BCL |   Byte[7] |  16.286 ns | 0.0342 ns | 0.0320 ns |
|  SSSE3 |   Byte[7] |  15.236 ns | 0.0131 ns | 0.0116 ns |

|    BCL |   Byte[8] |  17.490 ns | 0.0677 ns | 0.0633 ns |
|  SSSE3 |   Byte[8] |  11.907 ns | 0.0072 ns | 0.0064 ns |

|    BCL |   Byte[9] |  19.622 ns | 0.0416 ns | 0.0389 ns |
|  SSSE3 |   Byte[9] |  14.095 ns | 0.0476 ns | 0.0445 ns |

|    BCL |  Byte[10] |  21.176 ns | 0.0521 ns | 0.0488 ns |
|  SSSE3 |  Byte[10] |  15.259 ns | 0.0149 ns | 0.0140 ns |

|    BCL |  Byte[22] |  40.949 ns | 0.0787 ns | 0.0657 ns |
|  SSSE3 |  Byte[22] |  19.479 ns | 0.0531 ns | 0.0497 ns |

|    BCL |  Byte[32] |  57.793 ns | 0.2507 ns | 0.2222 ns |
|  SSSE3 |  Byte[32] |  20.652 ns | 0.0449 ns | 0.0375 ns |

|    BCL |  Byte[64] | 112.013 ns | 0.4231 ns | 0.3751 ns |
|  SSSE3 |  Byte[64] |  31.294 ns | 0.1032 ns | 0.0915 ns |

|    BCL | Byte[366] | 619.513 ns | 0.8064 ns | 0.7543 ns |
|  SSSE3 | Byte[366] | 143.214 ns | 0.3288 ns | 0.2915 ns |

Some of the directly affected APIs:

  • System.Net.Http.AuthenticationHelper.ComputeHash(string data, string algorithm)
  • System.Converter.ToHexString(ReadOnlySpan<byte> bytes) -- new API
  • System.Converter.ToHexString(byte[] inArray) -- new API
  • Some crypto-related APIs

Will add more test-cases to ConvertToHexStringTests

@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

@GrabYourPitchforks
Copy link
Member

FWIW, I believe the common use cases for this API are when the input is <= 32 bytes, and occasionally 64 bytes. The number of use cases past that point are vanishingly small. Egor's table doesn't include entries for 32 or 64 bytes, but interpolating the data suggests SSSE3-intrinsicified should see 3x throughput compared to baseline.

@GrabYourPitchforks GrabYourPitchforks added enhancement Product code improvement that does NOT require public API changes/additions tenet-performance Performance related issue labels Oct 31, 2020
src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved
src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved
src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved
src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved
@EgorBo
Copy link
Member Author

EgorBo commented Nov 5, 2020

Does it look good?

Copy link
Member

@GrabYourPitchforks GrabYourPitchforks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left some really low-pri comments, feel free to ignore as you see fit. Thanks for driving this. :)

src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved
src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved
@EgorBo
Copy link
Member Author

EgorBo commented Nov 26, 2020

@stephentoub @jkotas can this be merged?

@jkotas
Copy link
Member

jkotas commented Nov 26, 2020

Some of the directly affected APIs

What are the number for these public APIs?


// The high bytes (0x00) of the chars have also been converted
// to ascii hex '0', so clear them out.
hex = Sse2.And(hex, Vector128.Create((ushort)0xFF).AsByte());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea for micro-optimization.

Vector.Create creates the constant, that needs to be read from memory. With the other "constants" used here, it's likely they will be brought together into L1D, so the cost should be low.

But it still evicts a cacheline from L1D's set, so one could avoid this load by using bit-shifting for the masking. Something like

// (pseudo code style)
tmp = Sse2.ShiftLeftLogical(vec, 8)
Sse2.ShiftRightLogical(tmp, 8)

Latency and throughput for the shift is good, but this introduces a register dependency.

In micro-benchmarks the L1D will be hot, so it won't harm here, but in real usages that's likely not the case, so avoiding the load may be a plus.

TBH I don't know if it's worth it and so far it's more theory...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea! However, I don't see any differences in benchmarks (even slightly slower).

Sse2.And(hex, Vector128.Create((ushort)0xFF))

is emitted as vpand xmm0, xmm0, xmmword ptr [reloc @RWD00] (memory load without an additional register)
It probably makes sense to hoist them from loop too but it also doesn't affect the benchmarks (almost).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't affect the benchmarks

L1D is hot, so I expected this.
For real world usage I doubt the benchmarks will show a measurable difference too.
But according the theory a different cacheline could be kept in the sets of L1D, as it's not evicted, which could be good in the overall.

To show this in a benchmark one would need the address of the @RWD00 so the cache set is known (with cpu data from cpu-z, etc.). Then load from the address and other data(s), that "fight" for the same cache set, so this set is evicted constantly. But this is almost impossible to do reliable in a benchmark.

@EgorBo
Copy link
Member Author

EgorBo commented Nov 26, 2020

Some of the directly affected APIs

What are the number for these public APIs?

The main goal was to optimize the new APIs, I couldn't find any noticeable difference for others,
AuthenticationHelper.ComputeHash is actually private but it shouldn't regress since it always work with 32/64 bytes (SHA256/SHA512 or 16 for MD5 if it's used there).
https://gist.github.com/EgorBo/cb08048d3fc2d49a12921a859e184219

|                   Method |     array |        Mean |
|------------------------- |---------- |------------:|

|            Convert_ToHex |   Byte[4] |    16.36 ns | master
|            Convert_ToHex |   Byte[4] |    16.28 ns | PR

|            Convert_ToHex |   Byte[8] |    23.18 ns | master
|            Convert_ToHex |   Byte[8] |    18.01 ns | PR

|            Convert_ToHex |  Byte[10] |    26.63 ns | master
|            Convert_ToHex |  Byte[10] |    21.15 ns | PR

|            Convert_ToHex |  Byte[20] |    45.23 ns | master
|            Convert_ToHex |  Byte[20] |    22.53 ns | PR

|            Convert_ToHex |  Byte[32] |    64.74 ns | master
|            Convert_ToHex |  Byte[32] |    26.28 ns | PR

|            Convert_ToHex |  Byte[64] |   120.19 ns | master
|            Convert_ToHex |  Byte[64] |    38.17 ns | PR

|            Convert_ToHex | Byte[512] |   881.49 ns | master
|            Convert_ToHex | Byte[512] |   223.56 ns | PR

| X509Certificate_ToString |           | 1,008.64 ns | master
| X509Certificate_ToString |           | 1,006.13 ns | PR

@GrabYourPitchforks
Copy link
Member

@jeffhandley did you mean to assign this to me?

@EgorBo are we waiting for anything else before merge?

@EgorBo
Copy link
Member Author

EgorBo commented Jan 24, 2021

@jeffhandley @GrabYourPitchforks it's finished from my end

@jeffhandley
Copy link
Member

Were you just waiting for another signoff then, @EgorBo?

@EgorBo
Copy link
Member Author

EgorBo commented Jan 24, 2021

Were you just waiting for another signoff then, @EgorBo?

Ah, I never merged non-mono related PRs 🙂, can I merge it now then?

Copy link
Member

@jeffhandley jeffhandley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, cool. Yeah, go for it, @EgorBo.

@EgorBo EgorBo merged commit 2f1def8 into dotnet:master Jan 24, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Feb 27, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Runtime enhancement Product code improvement that does NOT require public API changes/additions tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants