Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make "async ValueTask/ValueTask<T>" methods ammortized allocation-free #26310

Draft
wants to merge 2 commits into
base: master
from

Conversation

@stephentoub
Copy link
Member

commented Aug 22, 2019

Today async ValueTask/ValueTask<T> methods use builders that special-case the synchronously completing case (to just return a default(ValueTask) or new ValueTask<T>(result)) but defer to the equivalent of async Task/Task<T> for when the operation completes asynchronously. This, however, doesn't take advantage of the fact that value tasks can wrap arbitrary IValueTaskSource/IValueTaskSource<T> implementations.

I've had this sitting on the shelf for a while, but finally cleaned it up. The first three commits here are just moving around existing code. The last two commits are the meat of this change. This changes AsyncValueTaskMethodBuilder and AsyncValueTaskMethodBuilder<T> to use pooled IValueTaskSource/IValueTaskSource<T> instances, such that calls to an async ValueTask/ValueTask<T> method incur 0 allocations as long as there's a cached object available.

I've marked this as a draft and work-in-progress for a few reasons:

  1. There's a breaking change here, in that while we say/document that ValueTask/ValueTask<T>s are more limited in what they can be used for, nothing in the implementation actually stops a ValueTask that was wrapping a Task from being used as permissively as Task, e.g. if you were to await such a ValueTask multiple times, it would happen to work, even though we say "never do that". This change means that anyone who was relying on such undocumented behaviors will now be broken. I think this is a reasonable thing to do in a major release, but I also want feedback and a lot of runway on it. There's the potential to make it opt-in (or opt-out) as well, but that will also non-trivially complicate the implementation.
  2. Policy around pooling. This is always a difficult thing to tune. Right now I've chosen a policy that limits the number of pooled objects per state machine type to an arbitrary multiple of the processor count, and that tries to strike a balance between contention and garbage by using a spin lock and if there's any contention while trying to get or return a pooled object, the cache is ignored. We will need to think hard about what policy to use here. It's also possible it could be tuned per state machine, e.g. by having an attribute that's looked up via reflection when creating the cache for a state machine, but that'll add a lot of first-access overhead to any async ValueTask/ValueTask<T> method.
  3. Perf validation.

cc: @kouvel, @tarekgh, @benaadams

@jkotas

This comment has been minimized.

Copy link
Member

commented Aug 22, 2019

What do we expect the performance gains from this to be - for primary perf metrics like requests per second for something real looking? (I understand that this will reduce the number of GCs by moving the cost elsewhere.)

@stephentoub

This comment has been minimized.

Copy link
Member Author

commented Aug 22, 2019

What do we expect the performance gains from this to be - for primary perf metrics like requests per second for something real looking?

Don't know yet. I'm hoping by putting this up as a draft folks will try it out :) I'm also going to try it on a few workloads. I won't be merging it until it proves valuable.

@benaadams

This comment has been minimized.

Copy link
Collaborator

commented Aug 22, 2019

Will give it a whirl :)

@benaadams

This comment has been minimized.

Copy link
Collaborator

commented Aug 22, 2019

Isn't very happy

Unhandled exception. System.InvalidOperationException: Operation is not valid due to the current state of the object.
   at System.Threading.Tasks.Sources.ManualResetValueTaskSourceCore`1.OnCompleted(Action`1 continuation, Object state, Int16 token, ValueTaskSourceOnCompletedFlags flags) in C:\GitHub\coreclr\src\System.Private.CoreLib\shared\System\Threading\Tasks\Sources\ManualResetValueTaskSourceCore.cs:line 163
   at System.Runtime.CompilerServices.AsyncValueTaskMethodBuilder`1.ValueTaskStateMachineBox.OnCompleted(Action`1 continuation, Object state, Int16 token, ValueTaskSourceOnCompletedFlags flags) in C:\GitHub\coreclr\src\System.Private.CoreLib\shared\System\Runtime\CompilerServices\AsyncValueTaskMethodBuilderT.cs:line 309
   at System.Runtime.CompilerServices.ValueTaskAwaiter`1.System.Runtime.CompilerServices.IStateMachineBoxAwareAwaiter.AwaitUnsafeOnCompleted(IAsyncStateMachineBox box) in C:\GitHub\coreclr\src\System.Private.CoreLib\shared\System\Runtime\CompilerServices\ValueTaskAwaiter.cs:line 188
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AwaitUnsafeOnCompleted[TAwaiter,TStateMachine](TAwaiter& awaiter, TStateMachine& stateMachine) in C:\GitHub\coreclr\src\System.Private.CoreLib\shared\System\Runtime\CompilerServices\AsyncTaskMethodBuilderT.cs:line 88
--- End of stack trace from previous location where exception was thrown ---
   at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__139_1(Object state) in C:\GitHub\coreclr\src\System.Private.CoreLib\shared\System\Threading\Tasks\Task.cs:line 1910
   at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute() in C:\GitHub\coreclr\src\System.Private.CoreLib\shared\System\Threading\ThreadPool.cs:line 880
   at System.Threading.ThreadPoolWorkQueue.Dispatch() in C:\GitHub\coreclr\src\System.Private.CoreLib\shared\System\Threading\ThreadPool.cs:line 700
@stephentoub

This comment has been minimized.

Copy link
Member Author

commented Aug 22, 2019

Isn't very happy

What is this running?

@benaadams

This comment has been minimized.

Copy link
Collaborator

commented Aug 22, 2019

Isn't very happy

What is this running?

Just trying to debug into it as the stack trace clearly isn't very helpful :)

@halter73

This comment has been minimized.

Copy link
Member

commented Aug 22, 2019

e.g. if you were to await such a ValueTask multiple times, it would happen to work, even though we say "never do that". This change means that anyone who was relying on such undocumented behaviors will now be broken.

I say it's better to make this change sooner rather than later. I like that this change might teach people the hard way not to await the same ValueTask multiple times since it will no longer just Pipes and other IValueTaskSource-based ValueTasks that break.

Do you expect there will be any common workloads where pooling the boxes hurts perf?

@stephentoub

This comment has been minimized.

Copy link
Member Author

commented Aug 22, 2019

Do you expect there will be any common workloads where pooling the boxes hurts perf?

It's hard to say.

Using pooling is based on the assumption that the overheads associated with pooling (e.g. synchronization to access the pool, any cache effects that come from using the pool, etc.) are less than the overheads associated with allocation+GC. That is not always the case, but it's also hard to predict.

There's also the fact that this is changing the underlying implementation, so it's not just pooling vs non-pooling, but ValueTask wrapping Task vs wrapping IValueTaskSource, different code paths being taken internally, different synchronization used to hook up a continuation, etc. Some of that is just tuning, but some of it will be trade-offs as well. In theory this approach could have less overhead even if pooling weren't used at all (just using a custom IValueTaskSource) because it doesn't need to accomodate things like multiple continuations, but we're also talking minutia at that point.

In the end, I'll want us to vet this on real workloads that demonstrate real value before moving ahead with it.

@mburbea

This comment has been minimized.

Copy link

commented Aug 22, 2019

@mgravell has a pretty interesting library that does something very similar to this
https://github.com/mgravell/PooledAwait

@benaadams

This comment has been minimized.

Copy link
Collaborator

commented Aug 22, 2019

Revisiting #21612 again 😉

Since ValueTaskStateMachineBox is a subtype would just calling it StateMachineBox work (as is done for many Enumerators)?

So rather than
AsyncValueTaskMethodBuilder'1.ValueTaskStateMachineBox.OnCompleted
It is
AsyncValueTaskMethodBuilder'1.StateMachineBox.OnCompleted

However that's not too important 😄

@benaadams

This comment has been minimized.

Copy link
Collaborator

commented Aug 22, 2019

Issue I'm having I think is in websockets; as its specifically commented about it https://github.com/dotnet/corefx/blob/ba9f27788bd99a29e5d832531f0ee86d8eae4959/src/System.Net.WebSockets/src/System/Net/WebSockets/ManagedWebSocket.netcoreapp.cs#L30-L31

// WARNING: This code is only valid because ReceiveAsyncPrivate returns a ValueTask that wraps a T or a Task.
// If that ever changes where ReceiveAsyncPrivate could wrap an IValueTaskSource, this must also change.

None of the other ValueTasks seem to be having issues

@stephentoub

This comment has been minimized.

Copy link
Member Author

commented Aug 22, 2019

issue I'm having I think is in websockets

Yup, I wrote both that code and comment :) Shame on me. Will change it.

@mgravell

This comment has been minimized.

Copy link

commented Aug 22, 2019

@mburbea I would be genuinely delighted if I could nuke that lib in entirety because the feature has been implemented in the core lib by folks who know better than me what they're doing.

@benaadams

This comment has been minimized.

Copy link
Collaborator

commented Aug 22, 2019

256 websocket connections ~30sec

Pre
image

Post
image

Might still want to do "Amortize WebSocket.EnsureBufferContainsAsync" dotnet/corefx#39455 not sure why its not really effected by this change; however it does address the harder to amortize async methods.

Will do some benchmarks

@stephentoub

This comment has been minimized.

Copy link
Member Author

commented Aug 22, 2019

not sure why its not really effected by this change

Oh, simpler explanation:

private async Task EnsureBufferContainsAsync

;-)

@benaadams

This comment has been minimized.

Copy link
Collaborator

commented Aug 22, 2019

Yep, changing entirely fixes it
image

@benaadams

This comment has been minimized.

Copy link
Collaborator

commented Aug 22, 2019

I'm liking this very much. Even with its low(ish) pool limit the allocation growth is low

200 websocket connections ~30sec

image

1000 websocket connections ~30sec

image

And the cost for renting and returning boxes seems fairly minimal

@stephentoub

This comment has been minimized.

Copy link
Member Author

commented Aug 22, 2019

I'm liking this very much

Thanks. Does it have a measurable impact on your throughout?

@benaadams

This comment has been minimized.

Copy link
Collaborator

commented Aug 22, 2019

The change in before and after allocations is huge 😄

1000 websocket connections ~30sec

Pre: 1.36M objects allocated (245MB)

image

Post: 209K objects allocated (20MB)

image

@benaadams

This comment has been minimized.

Copy link
Collaborator

commented Aug 23, 2019

Does it have a measurable impact on your throughout?

Yes(ish); unfortunately (or fortunately), it moves it into a realm beyond what I can load test maximums accurately with my current setup as I can't generate enough load. Its also highlighted some other hot spots in our own code.

While I can't currently quantify that (though will look into it more); more interestingly; it also means for the same number of GCs and RAM usage I can pack the containers at least x4 tighter when they aren't under maximum load.

@benaadams

This comment has been minimized.

Copy link
Collaborator

commented Aug 23, 2019

The overhead of GetOrCapture is less than capturing the ExecutionContext

image

Well some are more than others; but is a fairly insignificant cost

image

ReturnBox is a little more; but fairly small

image

@benaadams

This comment has been minimized.

Copy link
Collaborator

commented Aug 23, 2019

@mgravell are you in a position to give this a try for your workloads? Note it only applies to async ValueTask and ValueTask (not async Task)

@benaadams

This comment has been minimized.

@stephentoub

This comment has been minimized.

Copy link
Member Author

commented Aug 24, 2019

3.1?

Let's focus on determining whether it should go in at all. Then if it does we can determine if it should ship earlier than master otherwise would.

You seem motivated. 😉 Do you have a real workload you could try this with and report back on your perf (throughput, scale, memory, etc.)?

@davidfowl

This comment has been minimized.

Copy link
Contributor

commented Aug 25, 2019

I might have to spike ASP.NET Core with ValueTask return for everything 😄

@stephentoub

This comment has been minimized.

Copy link
Member Author

commented Aug 26, 2019

I might have to spike ASP.NET Core with ValueTask return for everything

@sebastienros, I'd like to get a comprehensive look at if/how this would affect ASP.NET performance. That would mean using bits produced by this PR, along with corefx bits from dotnet/corefx#40527, and an AspNetCore PR (that presumably @davidfowl or I or @benaadams could create) that would switch internal/private Task-returning methods to ValueTask-returning. What's the best way to go about that?

@davidfowl

This comment has been minimized.

Copy link
Contributor

commented Aug 26, 2019

and an AspNetCore PR (that presumably @davidfowl or I or @benaadams could create) that would switch internal/private Task-returning methods to ValueTask-returning. What's the best way to go about that?

I'm looking at that (started last night)

@stephentoub stephentoub force-pushed the stephentoub:asyncvaluetaskallocs branch from 3648e79 to 5856ab3 Aug 26, 2019

@stephentoub stephentoub changed the title [WIP] Make "async ValueTask/ValueTask<T>" methods ammortized allocation-free Make "async ValueTask/ValueTask<T>" methods ammortized allocation-free Aug 26, 2019

@stephentoub stephentoub added this to the 5.0 milestone Aug 26, 2019

@adamsitnik

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

What's the best way to go about that?

afaik you would need to pass the modified libraries via --output-file to BenchmarksDriver

I've recently created a template app and a doc that describes how to run a simple benchmark against modified CoreFX stuff: https://github.com/aspnet/Benchmarks/tree/master/src/BenchmarksApps/Samples/Template

@mgravell

This comment has been minimized.

Copy link

commented Aug 28, 2019

Just to note a thought someone raised in discussion of this point: the ValueTask docs currently basically say "prefer Task"

As such, the default choice for any asynchronous method should be to return a Task or Task. Only if performance analysis proves it worthwhile should a ValueTask be used instead of a Task.

If this change lands, it may be necessary/desirable to update that guidance. Fortunately, the docs do also say:

The following operations should never be performed on a ValueTask<TResult> instance:

  • Awaiting the instance multiple times.
    ...

:)

@stephentoub

This comment has been minimized.

Copy link
Member Author

commented Aug 28, 2019

If this change lands, it may be necessary/desirable to update that guidance.

Why?

@mgravell

This comment has been minimized.

Copy link

commented Aug 28, 2019

why?

Surely it significantly changes the balance of when the overheads of Task and ValueTask "bite"? meaning: in most await cases, whether synchronously-complete or full async, there would be a good reason to suspect that the ValueTask version would be at least as efficient, if not more efficient? The "measure it and find out" remains good guidance of course; I'm just suspicious as to whether it makes sense for the default choice to be Task[<T>] in the context of the above.

@stephentoub

This comment has been minimized.

Copy link
Member Author

commented Aug 28, 2019

Usability and correctness are the primary drivers here, and in the majority of situations, an extra allocation is not going to be impactful.

@davidfowl

This comment has been minimized.

Copy link
Contributor

commented Aug 28, 2019

Usability and correctness are the primary drivers here, and in the majority of situations, an extra allocation is not going to be impactful.

It can be when you multiply it by number of requests and number of state machines created in that request.

@stephentoub

This comment has been minimized.

Copy link
Member Author

commented Aug 28, 2019

It can be

Sure, "in the majority of situations" is not "all situations" :)

@davidfowl

This comment has been minimized.

Copy link
Contributor

commented Aug 28, 2019

In all ASP.NET applications 😄

@mgravell

This comment has been minimized.

Copy link

commented Aug 28, 2019

Sure, "in the majority of situations" is not "all situations" :)

I guess I'm coming at this from the angle of a library author dealing in relatively high throughput code paths, so from my perspective it is likely that ValueTask optimisations are likely to be worthwhile. Yes, there are lots of scenarios - especially in app code and especially nearer the top of the call stack - where it really isn't going to matter and the most obviously correct tool is probably the right default. We should make t-shirts : "Library authors: writing ugly and bizarre code for you, so you can pretend the codebase is clean"

@stephentoub

This comment has been minimized.

Copy link
Member Author

commented Aug 28, 2019

a library author dealing in relatively high throughput code paths

Yup, then you fall into the category of:
https://devblogs.microsoft.com/dotnet/understanding-the-whys-whats-and-whens-of-valuetask/
"However, ValueTask / ValueTask<TResult> are great choices when a) you expect consumers of your API to only await them directly, b) allocation-related overhead is important to avoid for your API, and c) either you expect synchronous completion to be a very common case, or you’re able to effectively pool objects for use with asynchronous completion."

@mgravell

This comment has been minimized.

Copy link

commented Aug 28, 2019

I feel like that just adds weight to the argument that the docs.microsoft.com documentation is incomplete... since the official docs guidance doesn't make those points. But: I'm happy to just nod and stop being a distraction. Getting the awesomeness properly evaluated is more important to me than the docs :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants
You can’t perform that action at this time.