Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved HTTP/1 header parser #63295

Closed
wants to merge 1 commit into from

Conversation

scalablecory
Copy link
Contributor

This replaces the SocketsHttpHandler header parser with two implementations: AVX2, and portable. This ports over the work done as part of the LLHTTP project.

This gives a 5-10% end-to-end perf improvement over localhost:

Faster base/diff Base Median (ns) Diff Median (ns) Modality
System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get(ssl: False, chunkedResponse 1.09 34244.28 31391.45
System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get(ssl: False, chunkedResponse 1.08 33410.35 30836.15
System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get_EnumerateHeaders_Validated( 1.08 34049.97 31524.56
System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get_EnumerateHeaders_Unvalidate 1.07 37427.29 34867.46 several?

The AVX2 parser has 50-100% performance improvement over the portable implementation. It is faster in every situation (few headers, lots of headers, small headers, large headers).

Method HeaderCount HeaderNameSize HeaderValueSize Mean Error StdDev
PortableHeaders 1 1 1 14.20 ns 0.078 ns 0.073 ns
Avx2Headers 1 1 1 10.50 ns 0.094 ns 0.088 ns
PortableHeaders 1 1 8 15.06 ns 0.096 ns 0.090 ns
Avx2Headers 1 1 8 10.48 ns 0.069 ns 0.065 ns
PortableHeaders 1 1 128 18.05 ns 0.089 ns 0.084 ns
Avx2Headers 1 1 128 12.20 ns 0.112 ns 0.105 ns
PortableHeaders 1 8 1 14.26 ns 0.087 ns 0.081 ns
Avx2Headers 1 8 1 10.49 ns 0.080 ns 0.071 ns
PortableHeaders 1 8 8 14.83 ns 0.074 ns 0.065 ns
Avx2Headers 1 8 8 10.52 ns 0.073 ns 0.069 ns
PortableHeaders 1 8 128 16.38 ns 0.103 ns 0.091 ns
Avx2Headers 1 8 128 12.73 ns 0.075 ns 0.066 ns
PortableHeaders 1 16 1 14.15 ns 0.129 ns 0.108 ns
Avx2Headers 1 16 1 10.67 ns 0.115 ns 0.096 ns
PortableHeaders 1 16 8 15.27 ns 0.227 ns 0.201 ns
Avx2Headers 1 16 8 10.69 ns 0.115 ns 0.108 ns
PortableHeaders 1 16 128 18.44 ns 0.143 ns 0.133 ns
Avx2Headers 1 16 128 13.01 ns 0.243 ns 0.227 ns
PortableHeaders 8 1 1 56.03 ns 0.711 ns 0.665 ns
Avx2Headers 8 1 1 29.64 ns 0.430 ns 0.403 ns
PortableHeaders 8 1 8 67.30 ns 0.311 ns 0.275 ns
Avx2Headers 8 1 8 31.10 ns 0.180 ns 0.151 ns
PortableHeaders 8 1 128 82.21 ns 0.448 ns 0.397 ns
Avx2Headers 8 1 128 50.41 ns 0.593 ns 0.555 ns
PortableHeaders 8 8 1 58.42 ns 1.005 ns 0.940 ns
Avx2Headers 8 8 1 30.94 ns 0.189 ns 0.176 ns
PortableHeaders 8 8 8 72.81 ns 0.912 ns 0.853 ns
Avx2Headers 8 8 8 31.56 ns 0.133 ns 0.124 ns
PortableHeaders 8 8 128 86.82 ns 0.714 ns 0.633 ns
Avx2Headers 8 8 128 51.55 ns 0.286 ns 0.267 ns
PortableHeaders 8 16 1 55.19 ns 0.160 ns 0.149 ns
Avx2Headers 8 16 1 32.95 ns 0.297 ns 0.278 ns
PortableHeaders 8 16 8 74.59 ns 0.375 ns 0.351 ns
Avx2Headers 8 16 8 34.22 ns 0.121 ns 0.108 ns
PortableHeaders 8 16 128 84.59 ns 0.311 ns 0.276 ns
Avx2Headers 8 16 128 51.96 ns 0.298 ns 0.279 ns
PortableHeaders 16 1 1 119.45 ns 0.736 ns 0.688 ns
Avx2Headers 16 1 1 52.34 ns 0.367 ns 0.306 ns
PortableHeaders 16 1 8 134.94 ns 0.466 ns 0.436 ns
Avx2Headers 16 1 8 53.98 ns 0.322 ns 0.301 ns
PortableHeaders 16 1 128 171.17 ns 1.379 ns 1.290 ns
Avx2Headers 16 1 128 100.83 ns 0.502 ns 0.445 ns
PortableHeaders 16 8 1 109.07 ns 0.498 ns 0.465 ns
Avx2Headers 16 8 1 53.92 ns 0.129 ns 0.120 ns
PortableHeaders 16 8 8 140.69 ns 0.591 ns 0.553 ns
Avx2Headers 16 8 8 56.92 ns 0.243 ns 0.203 ns
PortableHeaders 16 8 128 169.72 ns 0.592 ns 0.494 ns
Avx2Headers 16 8 128 99.44 ns 0.401 ns 0.375 ns
PortableHeaders 16 16 1 114.38 ns 0.338 ns 0.282 ns
Avx2Headers 16 16 1 56.66 ns 0.152 ns 0.142 ns
PortableHeaders 16 16 8 142.28 ns 0.481 ns 0.402 ns
Avx2Headers 16 16 8 60.48 ns 0.328 ns 0.306 ns
PortableHeaders 16 16 128 167.60 ns 0.416 ns 0.389 ns
Avx2Headers 16 16 128 108.95 ns 0.446 ns 0.417 ns

@ghost
Copy link

ghost commented Jan 3, 2022

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

This replaces the SocketsHttpHandler header parser with two implementations: AVX2, and portable. This ports over the work done as part of the LLHTTP project.

This gives a 5-10% end-to-end perf improvement over localhost:

Faster base/diff Base Median (ns) Diff Median (ns) Modality
System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get(ssl: False, chunkedResponse 1.09 34244.28 31391.45
System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get(ssl: False, chunkedResponse 1.08 33410.35 30836.15
System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get_EnumerateHeaders_Validated( 1.08 34049.97 31524.56
System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get_EnumerateHeaders_Unvalidate 1.07 37427.29 34867.46 several?

The AVX2 parser has 50-100% performance improvement over the portable implementation. It is faster in every situation (few headers, lots of headers, small headers, large headers).

Method HeaderCount HeaderNameSize HeaderValueSize Mean Error StdDev
PortableHeaders 1 1 1 14.20 ns 0.078 ns 0.073 ns
Avx2Headers 1 1 1 10.50 ns 0.094 ns 0.088 ns
PortableHeaders 1 1 8 15.06 ns 0.096 ns 0.090 ns
Avx2Headers 1 1 8 10.48 ns 0.069 ns 0.065 ns
PortableHeaders 1 1 128 18.05 ns 0.089 ns 0.084 ns
Avx2Headers 1 1 128 12.20 ns 0.112 ns 0.105 ns
PortableHeaders 1 8 1 14.26 ns 0.087 ns 0.081 ns
Avx2Headers 1 8 1 10.49 ns 0.080 ns 0.071 ns
PortableHeaders 1 8 8 14.83 ns 0.074 ns 0.065 ns
Avx2Headers 1 8 8 10.52 ns 0.073 ns 0.069 ns
PortableHeaders 1 8 128 16.38 ns 0.103 ns 0.091 ns
Avx2Headers 1 8 128 12.73 ns 0.075 ns 0.066 ns
PortableHeaders 1 16 1 14.15 ns 0.129 ns 0.108 ns
Avx2Headers 1 16 1 10.67 ns 0.115 ns 0.096 ns
PortableHeaders 1 16 8 15.27 ns 0.227 ns 0.201 ns
Avx2Headers 1 16 8 10.69 ns 0.115 ns 0.108 ns
PortableHeaders 1 16 128 18.44 ns 0.143 ns 0.133 ns
Avx2Headers 1 16 128 13.01 ns 0.243 ns 0.227 ns
PortableHeaders 8 1 1 56.03 ns 0.711 ns 0.665 ns
Avx2Headers 8 1 1 29.64 ns 0.430 ns 0.403 ns
PortableHeaders 8 1 8 67.30 ns 0.311 ns 0.275 ns
Avx2Headers 8 1 8 31.10 ns 0.180 ns 0.151 ns
PortableHeaders 8 1 128 82.21 ns 0.448 ns 0.397 ns
Avx2Headers 8 1 128 50.41 ns 0.593 ns 0.555 ns
PortableHeaders 8 8 1 58.42 ns 1.005 ns 0.940 ns
Avx2Headers 8 8 1 30.94 ns 0.189 ns 0.176 ns
PortableHeaders 8 8 8 72.81 ns 0.912 ns 0.853 ns
Avx2Headers 8 8 8 31.56 ns 0.133 ns 0.124 ns
PortableHeaders 8 8 128 86.82 ns 0.714 ns 0.633 ns
Avx2Headers 8 8 128 51.55 ns 0.286 ns 0.267 ns
PortableHeaders 8 16 1 55.19 ns 0.160 ns 0.149 ns
Avx2Headers 8 16 1 32.95 ns 0.297 ns 0.278 ns
PortableHeaders 8 16 8 74.59 ns 0.375 ns 0.351 ns
Avx2Headers 8 16 8 34.22 ns 0.121 ns 0.108 ns
PortableHeaders 8 16 128 84.59 ns 0.311 ns 0.276 ns
Avx2Headers 8 16 128 51.96 ns 0.298 ns 0.279 ns
PortableHeaders 16 1 1 119.45 ns 0.736 ns 0.688 ns
Avx2Headers 16 1 1 52.34 ns 0.367 ns 0.306 ns
PortableHeaders 16 1 8 134.94 ns 0.466 ns 0.436 ns
Avx2Headers 16 1 8 53.98 ns 0.322 ns 0.301 ns
PortableHeaders 16 1 128 171.17 ns 1.379 ns 1.290 ns
Avx2Headers 16 1 128 100.83 ns 0.502 ns 0.445 ns
PortableHeaders 16 8 1 109.07 ns 0.498 ns 0.465 ns
Avx2Headers 16 8 1 53.92 ns 0.129 ns 0.120 ns
PortableHeaders 16 8 8 140.69 ns 0.591 ns 0.553 ns
Avx2Headers 16 8 8 56.92 ns 0.243 ns 0.203 ns
PortableHeaders 16 8 128 169.72 ns 0.592 ns 0.494 ns
Avx2Headers 16 8 128 99.44 ns 0.401 ns 0.375 ns
PortableHeaders 16 16 1 114.38 ns 0.338 ns 0.282 ns
Avx2Headers 16 16 1 56.66 ns 0.152 ns 0.142 ns
PortableHeaders 16 16 8 142.28 ns 0.481 ns 0.402 ns
Avx2Headers 16 16 8 60.48 ns 0.328 ns 0.306 ns
PortableHeaders 16 16 128 167.60 ns 0.416 ns 0.389 ns
Avx2Headers 16 16 128 108.95 ns 0.446 ns 0.417 ns
Author: scalablecory
Assignees: -
Labels:

area-System.Net.Http

Milestone: -

/// This is done via <see cref="SimdPaddedArray"/>.
/// It also assumes that it can always step backwards to a 32-byte aligned address.
/// </remarks>
private unsafe (bool finished, int bytesConsumed) ParseHeadersAvx2(Span<byte> buffer, HttpResponseMessage? response, bool isFromTrailer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be changed to be Vector128<T> based, possibly even waiting for #61649 to be merged and using the new xplat HWIntrinsic APIs.

We shouldn't be merging acceleration here, or anywhere else in the BCL without also including ARM64 support and that means using Vector128<T> as a first class path.

There may be additional optimization opportunities involving Vector256<T> (and therefore AVX2 or eventually SVE on ARM); but we always need a V128<T> path on these APIs.

CC. @danmoseley

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this work @scalablecory ! I agree with @tannergooding that we should nowadays consider Arm64 support as essential. But doing it explicitly would be wasted work. @tannergooding how soon can Vector128<T> be useable ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(It'd also be really helpful if, as part of making it fully usable, we aggressively fixed up all current usage of intrinsics in the repo to use Vector128. I believe that's part of the plan, just making sure. Then subsequent usage can use that as a guide.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we aggressively fixed up all current usage of intrinsics in the repo to use Vector128. I believe that's part of the plan, just making sure.

Right, that is the current plan. Vector128<T> is the "common" surface area to all currently supported platforms (x86, x64, ARM64) and will likewise be supported on future platforms if we add them (WASM, ARM32, PowerPC, etc).
-- A large reason for this is because Vector128<float> is the natural type for graphical and multimedia based applications, so almost all CPUs add some sort of SIMD support for 128-bit vectors.

There may still be cases where having an additional path supporting Vector256<T> is beneficial, particularly if the workloads are known to be big. But given that ARM32/64 doesn't support Vector256<T> (we could only emulate it as 2x Vector128<T> ops), Vector128<T> is likely the base type that all SIMD algorithms need to start with.
-- This also ensures platforms without AVX/AVX2 support are accelerated. x64 emulation on ARM64 for example only supports up to SSE4.1 on both Windows and MacOS. It also ensures low power or budget CPUs (such as Intel Atom) are supported.

But doing it explicitly would be wasted work. @tannergooding how soon can Vector128 be useable ?

#61649 is the last "big" stepping stone and ensures Vector128<T>.IsHardwareAccelerated returns true. It's just pending review from @echesakovMSFT (who I pinged this morning and indicated they would try to get to it this week).
-- There are still a couple additional PRs to go up after this, but they mainly are for adding nint/nuint support and adding in a couple of additional APIs that were approved for after the initial API review for the xplat APIs, such as the Sum method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be merging acceleration here, or anywhere else in the BCL without also including ARM64 support and that means using Vector128<T> as a first class path.

Can you expand here? I can read this as three ways:

  • We have new cross-plat vector stuff that we think will be applicable here, but it isn't complete, and maybe we can use this as a test case.
  • We have red tape saying no platform-specific intrinsics without also having an ARM64 path. This seems like we'd be letting perfect get in the way of good -- I'm happy to revisit an ARM64 version in a separate issue, but delaying this change doesn't seem to add any value?
  • We intend to do away with platform-specific intrinsics in libraries so we shouldn't start adding more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have red tape saying no platform-specific intrinsics without also having an ARM64 path.

I think this is the most applicable. ARM64, particularly for .NET 7, is a major top down focus. It is a first class platform and we should not be adding new code that accelerates x64 and skips out on accelerating ARM64.

We have new cross-plat vector stuff that we think will be applicable here, but it isn't complete,

The new cross-platform vector APIs will be applicable here. It's practically ready, just pending the last PR getting merged. The entire purpose of it is to help trivialize support ARM64 (and other non-x64 platforms) because we have a goal of ensure they are also treated as "first class".

In most of the cases where we have SIMD code supporting both x64 and ARM64, the two code paths are nearly identical. Often only differing in a couple places, if anywhere at all. The new APIs allow you to do things like x + y rather than AdvSimd.Add(x, y) and Sse.Add(x, y). This allows most of the code to be shared.

There will still be a handful of places where there is no common functionality available or where there it is advantageous to specialize on a given platform. For those scenarios, you then trivially switch from xplat intrinsics to hardware specific intrinsics using if (AdvSimd.IsSupported) { /* ARM64 path */ } else if (Sse.IsSupported) { /* x86/x64 path */ else { /* fallback path */ }. Part of the reason this is "trivial" is because you are already using Vector128<T> and so you have shared simd logic; plat-specific simd logic; shared simd logic.

We intend to do away with platform-specific intrinsics in libraries so we shouldn't start adding more.

It won't go away, and there will still be places where there are platform-specific optimization opportunities available. But by default, most code adding SIMD support will have a Vector128<T> shared path that works for ARM64 and for small inputs on x86/x64. This same path will also allow implicit lightup on future platforms getting SIMD support, such as WASM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarification.

There will still be a handful of places where there is no common functionality available or where there it is advantageous to specialize on a given platform. For those scenarios, you then trivially switch from xplat intrinsics to hardware specific intrinsics using if (AdvSimd.IsSupported) { /* ARM64 path */ } else if (Sse.IsSupported) { /* x86/x64 path */ else { /* fallback path */ }. Part of the reason this is "trivial" is because you are already using Vector128<T> and so you have shared simd logic; plat-specific simd logic; shared simd logic.

This seems very reasonable.

We have red tape saying no platform-specific intrinsics without also having an ARM64 path.

I think this is the most applicable. ARM64, particularly for .NET 7, is a major top down focus. It is a first class platform and we should not be adding new code that accelerates x64 and skips out on accelerating ARM64.

Thanks, this makes sense. I agree fully with the goal but we need to work on how we word this guidance (for docs, and future PR feedback).

We shouldn't state it as a dogma of no Vector256, but rather something like "here's how we want cross-plat vector code to work (see your great fallback example); PRs should prefer this were possible, and justify it if not."

@geoffkizer
Copy link
Contributor

Re the "SimdPaddedArray" stuff: is there a standard/recommended way to do this? Seems like this would be a common problem for any SIMD user.

@geoffkizer
Copy link
Contributor

The portable version uses span.IndexOf and span.IndexOfAny. My understanding is that these are already vector-optimized. Is that not happening here for some reason?

Or, is the hand-rolled vectorization more efficient for some reason, e.g. we are amortizing some costs across operations or something like that?

@scalablecory
Copy link
Contributor Author

The portable version uses span.IndexOf and span.IndexOfAny. My understanding is that these are already vector-optimized. Is that not happening here for some reason?

Or, is the hand-rolled vectorization more efficient for some reason, e.g. we are amortizing some costs across operations or something like that?

Those are vectorized, but they throw away a lot of data that is useful for optimizing HTTP/1. In our case, 32 bytes will likely contain 2-3 headers, and this maintains (in a bitmask) all the interesting tokens to check. If we assume 3 headers, using IndexOf means 6 SIMD scans (for : and \n), while this would do just 1.

There is also perf improvement by removing bounds checks and hand optimizing for good codegen, but this was secondary.

@geoffkizer
Copy link
Contributor

I wonder if it would be better to have "intro" and "outro" code to handle misaligned prefix/postfix data.

This seems like it has a couple potential benefits:
(1) Don't need SimdPaddedArray at all
(2) Seems like it would remove some special cases in the inner loop, e.g. artificial match prior to beginning or after end of buffer
(3) Code is more common between different cases, meaning no-vector/vector128/vector256avx

Thoughts?


vector = Avx.LoadAlignedVector256(vectorIter);
foundLF = (uint)Avx2.MoveMask(Avx2.CompareEqual(vector, maskLF));
foundCol = foundLF | (uint)Avx2.MoveMask(Avx2.CompareEqual(vector, maskCol));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this is unnecessary work in a fair amount of cases... i.e. if the header value extends beyond this vector, which seems pretty common. Would it make sense to defer this until we actually find the \n indicating end of header value?

@scalablecory
Copy link
Contributor Author

I wonder if it would be better to have "intro" and "outro" code to handle misaligned prefix/postfix data.

This seems like it has a couple potential benefits: (1) Don't need SimdPaddedArray at all (2) Seems like it would remove some special cases in the inner loop, e.g. artificial match prior to beginning or after end of buffer (3) Code is more common between different cases, meaning no-vector/vector128/vector256avx

Thoughts?

Intro/outro approach is a no-go -- I had a version that did this and it was much slower. I also had a version that used a load mask which was also slower.

In theory the code should never match past the buffer because that will be 0-initialized; let me check that this is the case. I can't remember why I put the check for that back in.

We could use unaligned loads, or pad the front of the buffer with 0s too, to potentially avoid some checks. What I found was that matches in the "pre-buffer" area were rare enough that I wasn't sure the tradeoff of a mostly perfectly predicted branch VS the setup to remove the branch was worth it. I changed the code a bit since I tried this, though, so maybe it's worth another look.

@stephentoub
Copy link
Member

stephentoub commented Jan 3, 2022

Those are vectorized, but they throw away a lot of data that is useful for optimizing HTTP/1. In our case, 32 bytes will likely contain 2-3 headers, and this maintains (in a bitmask) all the interesting tokens to check. If we assume 3 headers, using IndexOf means 6 SIMD scans (for : and \n), while this would do just 1.

Can you elaborate on this? Specifically:

  1. What is the data source for 32 bytes containing 2-3 headers on average? Just a non-scientific test, when I do a get request to bing.com, the average header length is 150 chars, for httpbin.org it's 31, etc.
  2. What do you mean about IndexOf doing 6 scans whereas this doing only 1? IndexOfAny(char1, char2) similarly ors together a mask for each character; did you mean IndexOfAny doing 3 vs this doing only 1 (based on 3 headers fitting in 32 bytes)?

@scalablecory
Copy link
Contributor Author

Those are vectorized, but they throw away a lot of data that is useful for optimizing HTTP/1. In our case, 32 bytes will likely contain 2-3 headers, and this maintains (in a bitmask) all the interesting tokens to check. If we assume 3 headers, using IndexOf means 6 SIMD scans (for : and \n), while this would do just 1.

Can you elaborate on this? Specifically:

  1. What is the data source for 32 bytes containing 2-3 headers on average? Just a non-scientific test, when I do a get request to bing.com, the average header length is 150 chars, for httpbin.org it's 31, etc.

Based on prior experience working with a wide variety of web APIs. Websites certainly tend to have much larger headers, which is why I benchmarked against that as well.

  1. What do you mean about IndexOf doing 6 scans whereas this doing only 1? IndexOfAny(char1, char2) similarly ors together a mask for each character; did you mean IndexOfAny doing 3 vs this doing only 1 (based on 3 headers fitting in 32 bytes)?

Not IndexOf itself -- using IndexOf. For each header you have two operations:

colIdx = x.IndexOfAny(':', '\n');
lfIdx = x.IndexOf('\n');

Three headers will mean six calls to IndexOf. If the headers all fit within 32 bytes, the code in this PR avoids repeating all the work in front of the tzcnt of IndexOf.

@stephentoub
Copy link
Member

Based on prior experience working with a wide variety of web APIs. Websites certainly tend to have much larger headers, which is why I benchmarked against that as well.

Even the minimalist default web forecast microservice ASP.NET template has an average length of 27. I'm not discounting prior experience, but if we're going to base decisions on this, it'd be helpful to see the concrete data impacting it.

@scalablecory
Copy link
Contributor Author

scalablecory commented Jan 3, 2022

Based on prior experience working with a wide variety of web APIs. Websites certainly tend to have much larger headers, which is why I benchmarked against that as well.

Even the minimalist default web forecast microservice ASP.NET template has an average length of 27. I'm not discounting prior experience, but if we're going to base decisions on this, it'd be helpful to see the concrete data impacting it.

Treat my anecdote as a scenario to show where IndexOf can be sub-optimal. It wasn't intended to distract or stated as a reason to merge this PR -- we can have benchmarks for that. If you'd like to see more benchmarks, I can do that.

@stephentoub
Copy link
Member

stephentoub commented Jan 4, 2022

It wasn't intended to distract

I'm not distracted ;-) I'm trying to better understand the root of a cited ~35% improvement (in the case of the last line of the table). The example claimed the bulk of that improvement came from an ~6:1 reduction in calls to IndexOf{Any}. It sounds instead like in the typical case it might really be more like 2:1, essentially that we're saving on duplicated overheads in setup costs from the IndexOfAny+IndexOf vs this effectively merging them together. But there are other changes in the PR as well, so it's not clear to me how much is coming from that versus from other changes. And if it really is all coming from the vectorization, that concerns me as well, in that there's a lot of complexity being added here, and intrinsics/vectorized code is orders of magnitude harder to maintain; while System.Net.Http is a little special, it's still representative of fairly typical library code, and I don't want the majority of such devs to need to replace an IndexOfAny+IndexOf simple code with hand-tuned vectorized code in order to gain an additional 35%. Which then leads me to question if that really is the source of the gain, how can we do better? Can we push any improvements down into IndexOfAny/IndexOf? Can we add a new variant of these that can be used for key/value pair parsing and encapsulate this common pattern for SocketsHttpHandler to use but also for anyone else doing such key/value pair parsing to use (e.g. Utf8Parser.TryParseKeyValuePair(byte separator, byte end))? Etc.

@scalablecory
Copy link
Contributor Author

I'm going to close this PR and split it up.

AVX is not the bulk of the improvement here, so the portable change will be a no-brainer.

I'll also open up an issue detailing the AVX parser to maybe spark some ideas on how we make parsing better in .NET.

@stephentoub
Copy link
Member

AVX is not the bulk of the improvement here, so the portable change will be a no-brainer.

Great. Thanks.

@geoffkizer
Copy link
Contributor

AVX is not the bulk of the improvement here, so the portable change will be a no-brainer.

Given this, we should definitely split this up. Let's harvest non-vector improvements first and then we can focus on if/how vectorization improves this.

What are the main sources of improvement here?

I'm all for exploring using vector ops better in parsing -- it seems entirely possible that there could be some wins here. But I'll echo a couple points from @stephentoub above:
(1) "intrinsics/vectorized code is orders of magnitude harder to maintain" -- so we need to be super clear about what the win is from custom vectorization in order to evaluate whether we should do it;
(2) The kind of parsing done here is pretty common to do -- likely any text format would have similar characteristics, and many may actually be more complex and potentially benefit from vectorization more than HTTP/1.1. So we should be looking for common patterns that we can push to IndexOf or something like that, to benefit a broad set of scenarios, if possible.

@karelz karelz added this to the 7.0.0 milestone Jan 11, 2022
@dotnet dotnet locked as resolved and limited conversation to collaborators Feb 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants