New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved HTTP/1 header parser #63295
Conversation
Tagging subscribers to this area: @dotnet/ncl Issue DetailsThis replaces the SocketsHttpHandler header parser with two implementations: AVX2, and portable. This ports over the work done as part of the LLHTTP project. This gives a 5-10% end-to-end perf improvement over localhost:
The AVX2 parser has 50-100% performance improvement over the portable implementation. It is faster in every situation (few headers, lots of headers, small headers, large headers).
|
/// This is done via <see cref="SimdPaddedArray"/>. | ||
/// It also assumes that it can always step backwards to a 32-byte aligned address. | ||
/// </remarks> | ||
private unsafe (bool finished, int bytesConsumed) ParseHeadersAvx2(Span<byte> buffer, HttpResponseMessage? response, bool isFromTrailer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be changed to be Vector128<T>
based, possibly even waiting for #61649 to be merged and using the new xplat HWIntrinsic APIs.
We shouldn't be merging acceleration here, or anywhere else in the BCL without also including ARM64 support and that means using Vector128<T>
as a first class path.
There may be additional optimization opportunities involving Vector256<T>
(and therefore AVX2 or eventually SVE on ARM); but we always need a V128<T>
path on these APIs.
CC. @danmoseley
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this work @scalablecory ! I agree with @tannergooding that we should nowadays consider Arm64 support as essential. But doing it explicitly would be wasted work. @tannergooding how soon can Vector128<T>
be useable ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(It'd also be really helpful if, as part of making it fully usable, we aggressively fixed up all current usage of intrinsics in the repo to use Vector128. I believe that's part of the plan, just making sure. Then subsequent usage can use that as a guide.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we aggressively fixed up all current usage of intrinsics in the repo to use Vector128. I believe that's part of the plan, just making sure.
Right, that is the current plan. Vector128<T>
is the "common" surface area to all currently supported platforms (x86, x64, ARM64) and will likewise be supported on future platforms if we add them (WASM, ARM32, PowerPC, etc).
-- A large reason for this is because Vector128<float>
is the natural type for graphical and multimedia based applications, so almost all CPUs add some sort of SIMD support for 128-bit vectors.
There may still be cases where having an additional path supporting Vector256<T>
is beneficial, particularly if the workloads are known to be big. But given that ARM32/64
doesn't support Vector256<T>
(we could only emulate it as 2x Vector128<T>
ops), Vector128<T>
is likely the base type that all SIMD algorithms need to start with.
-- This also ensures platforms without AVX/AVX2 support are accelerated. x64 emulation on ARM64 for example only supports up to SSE4.1
on both Windows and MacOS. It also ensures low power or budget CPUs (such as Intel Atom
) are supported.
But doing it explicitly would be wasted work. @tannergooding how soon can Vector128 be useable ?
#61649 is the last "big" stepping stone and ensures Vector128<T>.IsHardwareAccelerated
returns true
. It's just pending review from @echesakovMSFT (who I pinged this morning and indicated they would try to get to it this week).
-- There are still a couple additional PRs to go up after this, but they mainly are for adding nint/nuint
support and adding in a couple of additional APIs that were approved for after the initial API review for the xplat APIs, such as the Sum
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't be merging acceleration here, or anywhere else in the BCL without also including ARM64 support and that means using
Vector128<T>
as a first class path.
Can you expand here? I can read this as three ways:
- We have new cross-plat vector stuff that we think will be applicable here, but it isn't complete, and maybe we can use this as a test case.
- We have red tape saying no platform-specific intrinsics without also having an ARM64 path. This seems like we'd be letting perfect get in the way of good -- I'm happy to revisit an ARM64 version in a separate issue, but delaying this change doesn't seem to add any value?
- We intend to do away with platform-specific intrinsics in libraries so we shouldn't start adding more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have red tape saying no platform-specific intrinsics without also having an ARM64 path.
I think this is the most applicable. ARM64, particularly for .NET 7, is a major top down focus. It is a first class platform and we should not be adding new code that accelerates x64 and skips out on accelerating ARM64.
We have new cross-plat vector stuff that we think will be applicable here, but it isn't complete,
The new cross-platform vector APIs will be applicable here. It's practically ready, just pending the last PR getting merged. The entire purpose of it is to help trivialize support ARM64 (and other non-x64 platforms) because we have a goal of ensure they are also treated as "first class".
In most of the cases where we have SIMD code supporting both x64
and ARM64
, the two code paths are nearly identical. Often only differing in a couple places, if anywhere at all. The new APIs allow you to do things like x + y
rather than AdvSimd.Add(x, y)
and Sse.Add(x, y)
. This allows most of the code to be shared.
There will still be a handful of places where there is no common functionality available or where there it is advantageous to specialize on a given platform. For those scenarios, you then trivially switch from xplat
intrinsics to hardware specific intrinsics
using if (AdvSimd.IsSupported) { /* ARM64 path */ } else if (Sse.IsSupported) { /* x86/x64 path */ else { /* fallback path */ }
. Part of the reason this is "trivial" is because you are already using Vector128<T>
and so you have shared simd logic; plat-specific simd logic; shared simd logic
.
We intend to do away with platform-specific intrinsics in libraries so we shouldn't start adding more.
It won't go away, and there will still be places where there are platform-specific optimization opportunities available. But by default, most code adding SIMD support will have a Vector128<T>
shared path that works for ARM64
and for small inputs on x86/x64
. This same path will also allow implicit lightup on future platforms getting SIMD support, such as WASM
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the clarification.
There will still be a handful of places where there is no common functionality available or where there it is advantageous to specialize on a given platform. For those scenarios, you then trivially switch from
xplat
intrinsics tohardware specific intrinsics
usingif (AdvSimd.IsSupported) { /* ARM64 path */ } else if (Sse.IsSupported) { /* x86/x64 path */ else { /* fallback path */ }
. Part of the reason this is "trivial" is because you are already usingVector128<T>
and so you haveshared simd logic; plat-specific simd logic; shared simd logic
.
This seems very reasonable.
We have red tape saying no platform-specific intrinsics without also having an ARM64 path.
I think this is the most applicable. ARM64, particularly for .NET 7, is a major top down focus. It is a first class platform and we should not be adding new code that accelerates x64 and skips out on accelerating ARM64.
Thanks, this makes sense. I agree fully with the goal but we need to work on how we word this guidance (for docs, and future PR feedback).
We shouldn't state it as a dogma of no Vector256, but rather something like "here's how we want cross-plat vector code to work (see your great fallback example); PRs should prefer this were possible, and justify it if not."
Re the "SimdPaddedArray" stuff: is there a standard/recommended way to do this? Seems like this would be a common problem for any SIMD user. |
The portable version uses span.IndexOf and span.IndexOfAny. My understanding is that these are already vector-optimized. Is that not happening here for some reason? Or, is the hand-rolled vectorization more efficient for some reason, e.g. we are amortizing some costs across operations or something like that? |
Those are vectorized, but they throw away a lot of data that is useful for optimizing HTTP/1. In our case, 32 bytes will likely contain 2-3 headers, and this maintains (in a bitmask) all the interesting tokens to check. If we assume 3 headers, using There is also perf improvement by removing bounds checks and hand optimizing for good codegen, but this was secondary. |
I wonder if it would be better to have "intro" and "outro" code to handle misaligned prefix/postfix data. This seems like it has a couple potential benefits: Thoughts? |
|
||
vector = Avx.LoadAlignedVector256(vectorIter); | ||
foundLF = (uint)Avx2.MoveMask(Avx2.CompareEqual(vector, maskLF)); | ||
foundCol = foundLF | (uint)Avx2.MoveMask(Avx2.CompareEqual(vector, maskCol)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this is unnecessary work in a fair amount of cases... i.e. if the header value extends beyond this vector, which seems pretty common. Would it make sense to defer this until we actually find the \n indicating end of header value?
Intro/outro approach is a no-go -- I had a version that did this and it was much slower. I also had a version that used a load mask which was also slower. In theory the code should never match past the buffer because that will be 0-initialized; let me check that this is the case. I can't remember why I put the check for that back in. We could use unaligned loads, or pad the front of the buffer with 0s too, to potentially avoid some checks. What I found was that matches in the "pre-buffer" area were rare enough that I wasn't sure the tradeoff of a mostly perfectly predicted branch VS the setup to remove the branch was worth it. I changed the code a bit since I tried this, though, so maybe it's worth another look. |
Can you elaborate on this? Specifically:
|
Based on prior experience working with a wide variety of web APIs. Websites certainly tend to have much larger headers, which is why I benchmarked against that as well.
Not colIdx = x.IndexOfAny(':', '\n');
lfIdx = x.IndexOf('\n'); Three headers will mean six calls to |
Even the minimalist default web forecast microservice ASP.NET template has an average length of 27. I'm not discounting prior experience, but if we're going to base decisions on this, it'd be helpful to see the concrete data impacting it. |
Treat my anecdote as a scenario to show where |
I'm not distracted ;-) I'm trying to better understand the root of a cited ~35% improvement (in the case of the last line of the table). The example claimed the bulk of that improvement came from an ~6:1 reduction in calls to IndexOf{Any}. It sounds instead like in the typical case it might really be more like 2:1, essentially that we're saving on duplicated overheads in setup costs from the IndexOfAny+IndexOf vs this effectively merging them together. But there are other changes in the PR as well, so it's not clear to me how much is coming from that versus from other changes. And if it really is all coming from the vectorization, that concerns me as well, in that there's a lot of complexity being added here, and intrinsics/vectorized code is orders of magnitude harder to maintain; while System.Net.Http is a little special, it's still representative of fairly typical library code, and I don't want the majority of such devs to need to replace an IndexOfAny+IndexOf simple code with hand-tuned vectorized code in order to gain an additional 35%. Which then leads me to question if that really is the source of the gain, how can we do better? Can we push any improvements down into IndexOfAny/IndexOf? Can we add a new variant of these that can be used for key/value pair parsing and encapsulate this common pattern for SocketsHttpHandler to use but also for anyone else doing such key/value pair parsing to use (e.g. |
I'm going to close this PR and split it up. AVX is not the bulk of the improvement here, so the portable change will be a no-brainer. I'll also open up an issue detailing the AVX parser to maybe spark some ideas on how we make parsing better in .NET. |
Great. Thanks. |
Given this, we should definitely split this up. Let's harvest non-vector improvements first and then we can focus on if/how vectorization improves this. What are the main sources of improvement here? I'm all for exploring using vector ops better in parsing -- it seems entirely possible that there could be some wins here. But I'll echo a couple points from @stephentoub above: |
This replaces the SocketsHttpHandler header parser with two implementations: AVX2, and portable. This ports over the work done as part of the LLHTTP project.
This gives a 5-10% end-to-end perf improvement over localhost:
The AVX2 parser has 50-100% performance improvement over the portable implementation. It is faster in every situation (few headers, lots of headers, small headers, large headers).