Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't mutate strings in Kestrel #17556

Merged
7 commits merged into from
Feb 10, 2020
Merged

Conversation

gfoidl
Copy link
Member

@gfoidl gfoidl commented Dec 3, 2019

Description

Addresses #8978

  • Replaced the mutating of strings with string.Create. To get cheap invocations of the SpanActions they are cached in static readonly fields, as this is quite faster than by caching via the C#-compiler generated code (it safes the null-check amongst some other boilerplate-code).

  • In TryGetAsciiStringBenchmarks added specific code-path for AVX2 and SSE2, as the codegen is much tighter than the Vector<T>-method. Further made use of Bmi2.ParallelBitDeposit to gain some perf in the non-vectorized path.
    Also changed the algorithm to exit early on non-ascii values, instead of iterating to the end and then returning true/false.

Benchmarks

Code for benchmarks

BytesToStringBenchmark

Before

|             Method |      Type |      Mean |    Error |    StdDev | Ratio | RatioSD |
|------------------- |---------- |----------:|---------:|----------:|------:|--------:|
| AsciiBytesToString | KeepAlive |  18.93 ns | 0.373 ns |  0.430 ns |  1.00 |    0.00 |
|  Utf8BytesToString | KeepAlive |  78.53 ns | 1.293 ns |  1.080 ns |  4.11 |    0.11 |
|                    |           |           |          |           |       |         |
| AsciiBytesToString |    Accept |  39.00 ns | 0.614 ns |  0.545 ns |  1.00 |    0.00 |
|  Utf8BytesToString |    Accept | 184.56 ns | 2.830 ns |  2.508 ns |  4.73 |    0.11 |
|                    |           |           |          |           |       |         |
| AsciiBytesToString | UserAgent |  41.99 ns | 0.837 ns |  1.889 ns |  1.00 |    0.00 |
|  Utf8BytesToString | UserAgent | 216.20 ns | 2.758 ns |  2.445 ns |  5.08 |    0.24 |
|                    |           |           |          |           |       |         |
| AsciiBytesToString |    Cookie |  71.71 ns | 1.384 ns |  1.594 ns |  1.00 |    0.00 |
|  Utf8BytesToString |    Cookie | 474.78 ns | 9.311 ns | 16.307 ns |  6.72 |    0.27 |

After

|             Method |      Type |      Mean |     Error |    StdDev | Ratio | RatioSD |
|------------------- |---------- |----------:|----------:|----------:|------:|--------:|
| AsciiBytesToString | KeepAlive |  18.22 ns |  0.335 ns |  0.297 ns |  1.00 |    0.00 |
|  Utf8BytesToString | KeepAlive |  76.59 ns |  1.158 ns |  1.027 ns |  4.21 |    0.09 |
|                    |           |           |           |           |       |         |
| AsciiBytesToString |    Accept |  35.56 ns |  0.668 ns |  0.743 ns |  1.00 |    0.00 |
|  Utf8BytesToString |    Accept | 165.42 ns |  1.290 ns |  1.144 ns |  4.64 |    0.12 |
|                    |           |           |           |           |       |         |
| AsciiBytesToString | UserAgent |  35.98 ns |  0.647 ns |  0.573 ns |  1.00 |    0.00 |
|  Utf8BytesToString | UserAgent | 200.06 ns |  3.965 ns |  5.686 ns |  5.52 |    0.14 |
|                    |           |           |           |           |       |         |
| AsciiBytesToString |    Cookie |  63.78 ns |  1.575 ns |  4.618 ns |  1.00 |    0.00 |
|  Utf8BytesToString |    Cookie | 454.94 ns | 10.547 ns | 17.909 ns |  6.82 |    0.56 |

TryGetAsciiStringBenchmarks

ThisPR_ExitEarly is the method finally used in this PR.

|           Method | BytesLen |      Mean |     Error |    StdDev | Ratio | RatioSD |
|----------------- |--------- |----------:|----------:|----------:|------:|--------:|
|          Default |        4 |  5.924 ns | 0.1577 ns | 0.2311 ns |  1.00 |    0.00 |
|           ThisPR |        4 |  6.853 ns | 0.1514 ns | 0.1416 ns |  1.14 |    0.05 |
| ThisPR_ExitEarly |        4 |  5.963 ns | 0.0912 ns | 0.0853 ns |  0.99 |    0.03 |
|                  |          |           |           |           |       |         |
|          Default |        8 |  6.913 ns | 0.0548 ns | 0.0512 ns |  1.00 |    0.00 |
|           ThisPR |        8 |  7.557 ns | 0.1806 ns | 0.2008 ns |  1.09 |    0.03 |
| ThisPR_ExitEarly |        8 |  6.055 ns | 0.0645 ns | 0.0603 ns |  0.88 |    0.01 |
|                  |          |           |           |           |       |         |
|          Default |       16 |  9.547 ns | 0.2118 ns | 0.1768 ns |  1.00 |    0.00 |
|           ThisPR |       16 |  5.080 ns | 0.1395 ns | 0.1492 ns |  0.53 |    0.02 |
| ThisPR_ExitEarly |       16 |  4.330 ns | 0.0464 ns | 0.0412 ns |  0.45 |    0.01 |
|                  |          |           |           |           |       |         |
|          Default |       32 |  7.693 ns | 0.1898 ns | 0.2031 ns |  1.00 |    0.00 |
|           ThisPR |       32 |  5.212 ns | 0.1429 ns | 0.1755 ns |  0.68 |    0.03 |
| ThisPR_ExitEarly |       32 |  4.647 ns | 0.1311 ns | 0.1962 ns |  0.60 |    0.03 |
|                  |          |           |           |           |       |         |
|          Default |      100 | 13.635 ns | 0.3478 ns | 0.4875 ns |  1.00 |    0.00 |
|           ThisPR |      100 | 11.991 ns | 0.2784 ns | 0.3419 ns |  0.88 |    0.04 |
| ThisPR_ExitEarly |      100 | 10.712 ns | 0.2825 ns | 0.4228 ns |  0.79 |    0.04 |
Further benchmarks

GetHeaderName

StringCreate is used in this PR.
NewString is a variant with stackallocing and then new string(Span<char>), as it avoid the delegate-call. But it's slower here.
Default_withImprovedStringUtitlites is the default version (master-branch), but with the improved TryGetAsciiStringBenchmarks-method.

|                              Method |     Mean |    Error |   StdDev | Ratio | RatioSD |
|------------------------------------ |---------:|---------:|---------:|------:|--------:|
|                             Default | 20.60 ns | 0.508 ns | 0.978 ns |  1.00 |    0.00 |
| Default_withImprovedStringUtitlites | 17.74 ns | 0.173 ns | 0.144 ns |  0.89 |    0.04 |
|                        StringCreate | 16.43 ns | 0.263 ns | 0.246 ns |  0.81 |    0.05 |
|                           NewString | 28.28 ns | 0.237 ns | 0.222 ns |  1.40 |    0.08 |

GetAsciiStringNonNullCharactersBenchmark

StringCreate is used in this PR.
Default_withImprovedStringUtitlites is the default version (master-branch), but with the improved TryGetAsciiStringBenchmarks-method.

|                              Method | BytesLen |     Mean |    Error |   StdDev | Ratio | RatioSD |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------------------------------ |--------- |---------:|---------:|---------:|------:|--------:|-------:|------:|------:|----------:|
|                             Default |        4 | 14.98 ns | 0.376 ns | 0.334 ns |  1.00 |    0.00 | 0.0102 |     - |     - |      32 B |
| Default_withImprovedStringUtitlites |        4 | 14.84 ns | 0.269 ns | 0.252 ns |  0.99 |    0.03 | 0.0102 |     - |     - |      32 B |
|                        StringCreate |        4 | 14.96 ns | 0.360 ns | 0.336 ns |  1.00 |    0.04 | 0.0102 |     - |     - |      32 B |
|                                     |          |          |          |          |       |         |        |       |       |           |
|                             Default |        8 | 16.86 ns | 0.563 ns | 0.860 ns |  1.00 |    0.00 | 0.0127 |     - |     - |      40 B |
| Default_withImprovedStringUtitlites |        8 | 16.12 ns | 0.229 ns | 0.203 ns |  0.94 |    0.05 | 0.0127 |     - |     - |      40 B |
|                        StringCreate |        8 | 15.59 ns | 0.376 ns | 0.369 ns |  0.91 |    0.05 | 0.0127 |     - |     - |      40 B |
|                                     |          |          |          |          |       |         |        |       |       |           |
|                             Default |       16 | 22.27 ns | 0.493 ns | 0.462 ns |  1.00 |    0.00 | 0.0178 |     - |     - |      56 B |
| Default_withImprovedStringUtitlites |       16 | 15.77 ns | 0.222 ns | 0.197 ns |  0.71 |    0.02 | 0.0178 |     - |     - |      56 B |
|                        StringCreate |       16 | 15.39 ns | 0.396 ns | 0.604 ns |  0.70 |    0.03 | 0.0179 |     - |     - |      56 B |
|                                     |          |          |          |          |       |         |        |       |       |           |
|                             Default |       32 | 18.00 ns | 0.449 ns | 0.699 ns |  1.00 |    0.00 | 0.0280 |     - |     - |      88 B |
| Default_withImprovedStringUtitlites |       32 | 18.59 ns | 0.457 ns | 0.813 ns |  1.04 |    0.06 | 0.0280 |     - |     - |      88 B |
|                        StringCreate |       32 | 17.29 ns | 0.232 ns | 0.206 ns |  0.95 |    0.05 | 0.0280 |     - |     - |      88 B |
|                                     |          |          |          |          |       |         |        |       |       |           |
|                             Default |      100 | 36.13 ns | 0.814 ns | 1.058 ns |  1.00 |    0.00 | 0.0714 |     - |     - |     224 B |
| Default_withImprovedStringUtitlites |      100 | 32.14 ns | 0.773 ns | 0.645 ns |  0.89 |    0.03 | 0.0714 |     - |     - |     224 B |
|                        StringCreate |      100 | 32.01 ns | 0.834 ns | 0.739 ns |  0.89 |    0.03 | 0.0714 |     - |     - |     224 B |

GetAsciiOrUTF8StringNonNullCharactersBenchmark

StringCreate is used in this PR.
Unfortunately BenchmarkDotNet trims the TestData-string so that the cases aren't clearly seen.
But each "set" of lenght has

  • one pure ascii string
  • ü inserted at end
  • ü inserted at index 0
|       Method |             TestData |      Mean |    Error |   StdDev |    Median | Ratio | RatioSD |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------- |--------------------- |----------:|---------:|---------:|----------:|------:|--------:|-------:|------:|------:|----------:|
|      Default |  (100(...)Rbi) [107] |  41.49 ns | 1.373 ns | 4.048 ns |  40.88 ns |  1.00 |    0.00 | 0.0714 |     - |     - |     224 B |
| StringCreate |  (100(...)Rbi) [107] |  33.98 ns | 0.842 ns | 1.064 ns |  33.77 ns |  0.76 |    0.06 | 0.0713 |     - |     - |     224 B |
|              |                      |           |          |          |           |       |         |        |       |       |           |
|      Default |  (100(...)Rbü) [107] | 124.54 ns | 2.500 ns | 2.778 ns | 124.52 ns |  1.00 |    0.00 | 0.1426 |     - |     - |     448 B |
| StringCreate |  (100(...)Rbü) [107] | 120.30 ns | 2.202 ns | 2.059 ns | 120.69 ns |  0.96 |    0.01 | 0.1426 |     - |     - |     448 B |
|              |                      |           |          |          |           |       |         |        |       |       |           |
|      Default |  (100(...)Rbi) [107] | 141.71 ns | 2.840 ns | 3.888 ns | 141.79 ns |  1.00 |    0.00 | 0.1428 |     - |     - |     448 B |
| StringCreate |  (100(...)Rbi) [107] | 127.49 ns | 2.496 ns | 2.335 ns | 127.87 ns |  0.89 |    0.02 | 0.1426 |     - |     - |     448 B |
|              |                      |           |          |          |           |       |         |        |       |       |           |
|      Default | (16, (...)Ox2I) [22] |  21.63 ns | 0.523 ns | 1.534 ns |  21.39 ns |  1.00 |    0.00 | 0.0178 |     - |     - |      56 B |
| StringCreate | (16, (...)Ox2I) [22] |  18.74 ns | 0.322 ns | 0.269 ns |  18.58 ns |  0.90 |    0.06 | 0.0178 |     - |     - |      56 B |
|              |                      |           |          |          |           |       |         |        |       |       |           |
|      Default | (16, (...)Ox2ü) [22] |  67.98 ns | 0.529 ns | 0.442 ns |  67.98 ns |  1.00 |    0.00 | 0.0356 |     - |     - |     112 B |
| StringCreate | (16, (...)Ox2ü) [22] |  62.43 ns | 1.324 ns | 1.301 ns |  62.16 ns |  0.92 |    0.02 | 0.0356 |     - |     - |     112 B |
|              |                      |           |          |          |           |       |         |        |       |       |           |
|      Default | (16, (...)Ox2I) [22] |  72.87 ns | 1.128 ns | 1.055 ns |  72.39 ns |  1.00 |    0.00 | 0.0356 |     - |     - |     112 B |
| StringCreate | (16, (...)Ox2I) [22] |  68.87 ns | 0.831 ns | 0.777 ns |  68.99 ns |  0.95 |    0.02 | 0.0356 |     - |     - |     112 B |
|              |                      |           |          |          |           |       |         |        |       |       |           |
|      Default | (32, (...)3qHC) [38] |  19.29 ns | 0.236 ns | 0.197 ns |  19.30 ns |  1.00 |    0.00 | 0.0280 |     - |     - |      88 B |
| StringCreate | (32, (...)3qHC) [38] |  19.61 ns | 0.165 ns | 0.137 ns |  19.59 ns |  1.02 |    0.01 | 0.0280 |     - |     - |      88 B |
|              |                      |           |          |          |           |       |         |        |       |       |           |
|      Default | (32, (...)3qHü) [38] |  82.47 ns | 1.663 ns | 3.754 ns |  83.62 ns |  1.00 |    0.00 | 0.0560 |     - |     - |     176 B |
| StringCreate | (32, (...)3qHü) [38] |  85.11 ns | 0.820 ns | 0.767 ns |  85.24 ns |  1.10 |    0.07 | 0.0560 |     - |     - |     176 B |
|              |                      |           |          |          |           |       |         |        |       |       |           |
|      Default | (32, (...)3qHC) [38] |  86.02 ns | 1.814 ns | 2.770 ns |  85.59 ns |  1.00 |    0.00 | 0.0560 |     - |     - |     176 B |
| StringCreate | (32, (...)3qHC) [38] |  81.59 ns | 0.735 ns | 0.687 ns |  81.44 ns |  0.93 |    0.03 | 0.0560 |     - |     - |     176 B |
|              |                      |           |          |          |           |       |         |        |       |       |           |
|      Default |            (4, F87w) |  16.24 ns | 0.314 ns | 0.245 ns |  16.28 ns |  1.00 |    0.00 | 0.0102 |     - |     - |      32 B |
| StringCreate |            (4, F87w) |  16.67 ns | 0.185 ns | 0.173 ns |  16.61 ns |  1.03 |    0.02 | 0.0102 |     - |     - |      32 B |
|              |                      |           |          |          |           |       |         |        |       |       |           |
|      Default |            (4, F87ü) |  57.34 ns | 0.571 ns | 0.446 ns |  57.22 ns |  1.00 |    0.00 | 0.0203 |     - |     - |      64 B |
| StringCreate |            (4, F87ü) |  56.60 ns | 0.455 ns | 0.426 ns |  56.51 ns |  0.99 |    0.01 | 0.0203 |     - |     - |      64 B |
|              |                      |           |          |          |           |       |         |        |       |       |           |
|      Default |            (4, ü87w) |  55.64 ns | 1.183 ns | 1.362 ns |  55.29 ns |  1.00 |    0.00 | 0.0203 |     - |     - |      64 B |
| StringCreate |            (4, ü87w) |  59.80 ns | 1.296 ns | 1.774 ns |  59.06 ns |  1.08 |    0.02 | 0.0203 |     - |     - |      64 B |
|              |                      |           |          |          |           |       |         |        |       |       |           |
|      Default |        (8, agIvaLef) |  16.25 ns | 0.427 ns | 1.110 ns |  15.64 ns |  1.00 |    0.00 | 0.0127 |     - |     - |      40 B |
| StringCreate |        (8, agIvaLef) |  17.17 ns | 0.580 ns | 0.691 ns |  16.97 ns |  0.97 |    0.06 | 0.0127 |     - |     - |      40 B |
|              |                      |           |          |          |           |       |         |        |       |       |           |
|      Default |        (8, agIvaLeü) |  60.83 ns | 0.884 ns | 0.783 ns |  60.70 ns |  1.00 |    0.00 | 0.0254 |     - |     - |      80 B |
| StringCreate |        (8, agIvaLeü) |  60.70 ns | 0.631 ns | 0.590 ns |  60.76 ns |  1.00 |    0.02 | 0.0254 |     - |     - |      80 B |
|              |                      |           |          |          |           |       |         |        |       |       |           |
|      Default |        (8, ügIvaLef) |  63.27 ns | 1.323 ns | 1.237 ns |  63.22 ns |  1.00 |    0.00 | 0.0254 |     - |     - |      80 B |
| StringCreate |        (8, ügIvaLef) |  62.51 ns | 0.580 ns | 0.542 ns |  62.53 ns |  0.99 |    0.02 | 0.0254 |     - |     - |      80 B |

Copy link
Member Author

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes for review.

@@ -86,52 +88,65 @@ private static unsafe ulong GetMaskAsLong(byte[] bytes)
}

// The same as GetAsciiStringNonNullCharacters but throws BadRequest
[MethodImpl(MethodImplOptions.AggressiveInlining)]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to inline this methods, as in effect it's just the delegate invocation.

var resultString = string.Create(span.Length, new IntPtr(source), s_getAsciiOrUtf8StringNonNullCharacters);

// If rersultString is marked, perform UTF-8 encoding
if (resultString[0] == '\0')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below.

if (!StringUtilities.TryGetAsciiString((byte*)state.ToPointer(), output, buffer.Length))
{
// Mark resultString for UTF-8 encoding
output[0] = '\0';
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't find a better solution on how to communicate back that non-acii data is available. Except using exceptions, but it's not exceptional and slow / miuse of exceptions.

That's why is marked with a "flag", which is checked above at L167.

@@ -17,102 +18,208 @@ internal class StringUtilities
[MethodImpl(MethodImplOptions.AggressiveOptimization)]
public static unsafe bool TryGetAsciiString(byte* input, char* output, int count)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aware of "no further unsafe code in Kestrel", but for code that used hardware intrinsics there is nearly no other way. Therefore I left the signature of this method as is.

{
// If Vector not-accelerated or remaining less than vector size
if (!Vector.IsHardwareAccelerated || input > end - Vector<sbyte>.Count)
if (Avx2.IsSupported && input <= end - Vector256<sbyte>.Count)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some checks with higher threshold for AVX2 kicking in, but with not so good results. Hence AVX2 kicks in as soon there are enough elements.

}
else
{
throw new PlatformNotSupportedException("should not be here");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will never be reached, but to indiciate that if Arm / Neon comes up as intrinsics the check may be expanded here. JIT eleminates this branch anyway.

}
else
{
throw new PlatformNotSupportedException("should not be here");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above by AVX2.

@halter73
Copy link
Member

halter73 commented Dec 4, 2019

Did using string.Create cause slowdown that was mitigated by the new usage of hardware intrinsics?

@gfoidl
Copy link
Member Author

gfoidl commented Dec 4, 2019

Did using string.Create cause slowdown

Yes. Due the overhead of the delegate call.
There is no point to improve here, except when Static Delegates come up.

that was mitigated by the new usage of hardware intrinsics?

Not really.
Look at this benchmark where Default_withImprovedStringUtitlites is the default implementation but with the new usage of HW. It shows a clear improvement.
StringCreate also uses the new HW and is even faster. The reasons are

  • the delegates are cached in static readonly fields, so safes quite a lot of code from the C#-compiler delegate handling (i.e. null-check, initialization)
  • thus the methods are reduced in effect (from a JIT point of view) to a check for empty-span and the cached delegate invocation
  • so these method can be inlined (with a little bit of forcing it, as the JIT won't do it, but the asm-code is not really more than a regular call would be with all the setup)

I would not force the inlining of the default implementations, as there is more code working and would result in bigger call-sites.
With the string.Create-approach it's possible, as the "work-code" is moved to the delegate (which in turn is more or less a call to TryGetAsciiString, which could be inlined into the delegate-method, but I didn't do this to keep the method general and keep it from inlining in all other (potential) usages).

This reverts commit 5ae80c2834471baf34d1e5a05a42e3cce1ff02d7.

This is a .NET STandard 2.0 project, so no span is available by default. I think it's not worth it to add a reference to System.Memory-package just for this change.
…ing a null check for the compiler generated cached delegate
@gfoidl
Copy link
Member Author

gfoidl commented Jan 5, 2020

Rebased due conflicts in

src/Servers/Kestrel/Core/src/Internal/Infrastructure/StringUtilities.cs
src/Servers/Kestrel/samples/http2cat/Http2Utilities.cs 

{
// If Vector not-accelerated or remaining less than vector size
if (!Vector.IsHardwareAccelerated || input > end - Vector<sbyte>.Count)
if (Avx2.IsSupported && input <= end - Vector256<sbyte>.Count)
Copy link
Member

@benaadams benaadams Jan 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pointers are unsigned and end - Vector256<sbyte>.Count can't go negative; so

byte* p = null;
if (p <= p - 15)
{
    Console.WriteLine(true);
}

Will output true

So its better to switch the subtraction around with a <= test and have input overflow if it goes negative e.g.

- input <= end - Vector256<sbyte>.Count
+ input - Vector256<sbyte>.Count <= end

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this condition won't work.
E.g. this constellation is valid and could be processed by AVX-codepath:

byte* input = 0;
byte* end = 32;

if (input - Vector256<sbyte>.Count <= end)
{
    // ...
}

Here it will wrap around to 0xffffffffffffffe0 <= 32 which is false.

So it should be input + Vector256<sbyte>.Count <= end which itself can overflow, hence the condition is written with the subtraction.

Your concern is valid, so I could add

public static unsafe bool TryGetAsciiString(byte* input, char* output, int count)
{
+   Debug.Assert(input != null);
+   Debug.Assert(output != null);

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just count >= Vector256<sbyte>.Count?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the easy way 😃

In

} while (input <= end - Vector256<sbyte>.Count);
the same conditions is used, the JIT will use the value from a register. I'll need to check codegen / check if the loop condition in still kept in a register.

And in the lines below -- either for differnt code-path or to process the remainder -- similar conditions are used. To use count here this value needs bookkeeping.

As this is a Try...-method I think it's better to just return false in case input == null.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it provably impossible for end < Vector256<sbyte>.Count if we know it's not null? I know we were already doing end - Vector<sbyte>.Count before, but I'm curious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory end < Vector256<sbyte>.Count can be true, but not in practice.
Even if count = 0 then end will be the same as input and that pointer won't reside at addresses lower than 32. Hence there is no real problem.
As it's internal I'll add another assert for this.

@@ -418,6 +529,36 @@ private static bool CheckBytesInAsciiRange(Vector<sbyte> check)
return Vector.GreaterThanAll(check, Vector<sbyte>.Zero);
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static bool CheckBytesInAsciiRange(Vector256<sbyte> check, Vector256<sbyte> zero)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why is the zero vector passed in as an argument in these methods?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that the "zero register" can be reused. Otherwise the JIT won't reuse that register and xors a register again.
Although zeroing (xor) a register has no latency, the cpu-frontend must fetch and decode the instruction, so it still has some cost, that can be avoided with this way.
Note: this works only because this method will be inlined.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static bool CheckBytesInAsciiRange(Vector256<sbyte> check, Vector256<sbyte> zero)
{
if (Avx2.IsSupported)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not actually possible for this to be called without Avx2.IsSupported being true, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

Right now I ask myself why I didn't go with a Debug.Assert to make this obvious.
There must be some other code pattern that influenced me to write it that way, but I can't remember.

Also why this method is public and not private.

I'll update this to the be simpler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, I didn't remember my own comment 😉

#17556 (comment)
That's why it's written that way.

Which way to go? With Debug.Assert or leave as is (plus add a comment explaining this).

public static unsafe string GetHeaderName(this ReadOnlySpan<byte> span)
{
if (span.IsEmpty)
{
return string.Empty;
}

var asciiString = new string('\0', span.Length);
fixed (byte* source = &MemoryMarshal.GetReference(span))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this perform better than fixed (byte* buffer = span) used previously?

Copy link
Member Author

@gfoidl gfoidl Feb 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes (it safes one length == 0 check).
GetPinnableReference needs to check for empty span, this check is already done above, so we are sure that we have a non-empty span, so can skip the check for empty by using MemoryMarshal.GetReference.


fixed (char* output = asciiString)
fixed (byte* buffer = span)
private static readonly SpanAction<char, IntPtr> s_getHeaderName = GetHeaderName;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we move the static field to the top of the class and start it with _ for consistency? I know the convention is different in the runtime repo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done -- will be updated with the next push.


fixed (char* output = asciiString)
fixed (byte* buffer = span)
private static readonly SpanAction<char, IntPtr> s_getAsciiStringNonNullCharacters = GetAsciiStringNonNullCharacters;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for s_getHeaderName

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done -- will be updated with the next push.

@benaadams
Copy link
Member

benaadams commented Feb 7, 2020

/cc @Maoni0 @GrabYourPitchforks

output += Vector128<sbyte>.Count;
} while (input <= end - Vector128<sbyte>.Count);

if (input == end)
Copy link
Member

@halter73 halter73 Feb 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only helps if the decodes string happens to have a multiple of 16 characters, right? Seems unlikely to be worth it on average.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a single check avoids the 4 check chain: <= sizeof(long), <= sizeof(int), <= sizeof(char), <= 1; though suppose it adds one more to everything else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's right. I tend to keep it, as @benaadams explained, and I think it doesn't harm on input-length that are not multiples of 16 chars long.

return false;
}

if (Bmi2.X64.IsSupported)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GrabYourPitchforks with respect to dotnet/runtime#2251 and your PR dotnet/runtime#31904 should Bmi2 be removed here too?
I don't know what's the current stance on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, BMI2 should be removed if possible. For AMD processors this PR is likely to introduce similar regressions as those we are fixing via dotnet/runtime#2251.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> #18949 to track this.

@ghost
Copy link

ghost commented Feb 10, 2020

Hello @halter73!

Because this pull request has the auto-merge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

@ghost ghost merged commit aa7804c into dotnet:master Feb 10, 2020
@halter73
Copy link
Member

Thanks @gfoidl!

@gfoidl gfoidl deleted the httputilities_getheadername branch February 11, 2020 09:36
@amcasey amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Jun 6, 2023
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants