Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ExecutionContext fast-paths #36538

Closed
wants to merge 4 commits into from

Conversation

benaadams
Copy link
Member

@benaadams benaadams commented May 15, 2020

Improve some of the ThreadPool fast-paths, reduce indirect calls for AsyncValueTaskMethodBuilder<T>; AsyncTaskMethodBuilder<T> and Timer; by calling .MoveNext() directly call rather via the indirect chain EC.Run -> delegate -> .MoveNext() which should help with the CPU's instruction cache and indirect branch prediction (Intel only tries to predict one indirect call, not chains?).

Split merged methods switched on if into their respective callers:

  • Remove extra call for AsyncValueTaskMethodBuilder<T> and AsyncTaskMethodBuilder<T> for MoveNext() and IThreadPoolWorkItem.Execute() (since they are top level virtual/interface and their callee is too big to inline into them)

Optimize invocations from ThreadPoolWorkQueue.Dispatch:

  • AsyncValueTaskMethodBuilder<T>; AsyncTaskMethodBuilder<T> and Timer can call .MoveNext() directly if Default context and called from ThreadPoolWorkQueue.Dispatch.
  • AsyncValueTaskMethodBuilder<T>; AsyncTaskMethodBuilder<T> and Timer don't need to touch ThreadStatic CurrentThread called from ThreadPoolWorkQueue.Dispatch.
  • AsyncValueTaskMethodBuilder<T>; AsyncTaskMethodBuilder<T> and Timer can use simpler EC run when not-Default context and called from ThreadPoolWorkQueue.Dispatch.

Optimize invocations for Default context:

  • AsyncValueTaskMethodBuilder<T>; AsyncTaskMethodBuilder<T> and Timer can call .MoveNext() directly if Default context and thread currently on Default context.

Json callers (on Default context; which become extra fast-pathed)

image

Fortunes callers (on Default context; which become extra fast-pathed)

image

@benaadams benaadams marked this pull request as ready for review May 15, 2020 20:43
@benaadams
Copy link
Member Author

@stephentoub issues resolved?

@stephentoub
Copy link
Member

@stephentoub issues resolved?

Functionally it looks ok to me. Does it have a measurable perf impact?

@benaadams
Copy link
Member Author

Does it have a measurable perf impact?

Is peanut butter; at least 0.5% of ThreadPool.Dispatch for Json

image

Though it does also fix an existing issue with EC restore for legacyUnhandledExceptionPolicy / host policy; that you highlighted in #36538 (comment)

@benaadams
Copy link
Member Author

Downside is it doesn't effect IOQueue callbacks; which do the full EC Run since they aren't directly run from the ThreadPool; which is also where some of the interface casting callbacks come from (since it isn't cast away on queue to ThreadPool)

image

@stephentoub
Copy link
Member

Is peanut butter; at least 0.5% of ThreadPool.Dispatch for Json

Does that mean we see this change improving json TE throughout by 10K RPS? Can we run the benchmarks with it?

@kouvel
Copy link
Member

kouvel commented May 16, 2020

Will long lived Action variables that are early captures point at their Tier0 methods rather than Tier1 due to when they were captured?

I'm not seeing this comment anymore, but the delegate would call the method through a precode and the precode target would be updated to point to the Tier1 code once it's ready. Currently when tiering is enabled, direct calls also call through a precode. When a method is jitted before a callee is jitted, the call would go through a precode (when tiering is disabled too). If the callee is already Tier1 it may be possible to avoid calling through the precode, I'll have to look into this, the same would apply to delegates as well.

@benaadams
Copy link
Member Author

Just curious... Are the precodes jmp blocks (like a trampoline or tailcall, as it were) or does it end up being a double call?

Currently when tiering is enabled, direct calls also call through a precode. When a method is jitted before a callee is jitted, the call would go through a precode (when tiering is disabled too).

So they always called this way rather direct calls going directly to method; making inlining having an additional effectiveness of eliminating the indirection? (aside from the other advantages of inlining)

@kouvel
Copy link
Member

kouvel commented May 16, 2020

Are the precodes jmp blocks (like a trampoline or tailcall, as it were)

Yea they are jmp blocks

So they always called this way rather direct calls going directly to method; making inlining having an additional effectiveness of eliminating the indirection?

Possibly, I had done a bit of prototyping on this before and didn't see much improvement in steady-state from fixing direct calls but some noticeable startup improvement as there would be more calls involved. May also depend on arch/processor, as prefetching helps to avoid the overhead. Delegate calls may be more expensive though, curious to see.

@benaadams
Copy link
Member Author

Delegate calls may be more expensive though, curious to see.

If I was to go all in; it would be something like this https://github.com/dotnet/runtime/compare/master...benaadams:EC-all-in?expand=1

To turn all the Default context => Default context runs into direct calls rather than the triple: call => delegate => call; so it would also pick up all these in for example the Fortunes benchmark:

image

Since they are chains of async only the first couple are picked up by this change as they run directly on ThreadPool, the follow on inlines don't get picked up.

@benaadams
Copy link
Member Author

Hmm, might be able to do that, but in better way

@benaadams
Copy link
Member Author

Does that mean we see this change improving json TE throughout by 10K RPS? Can we run the benchmarks with it?

@adamsitnik can you help here?

@benaadams benaadams force-pushed the EC-improves branch 2 times, most recently from 03f4148 to 31c37a6 Compare May 17, 2020 19:33
@adamsitnik
Copy link
Member

adamsitnik commented May 18, 2020

can you help here?

Sure! Is there any chance that you could send me a Release copy of your System.Private.CoreLib.dll built for Unix?

@adamsitnik
Copy link
Member

@benaadams I've run them twice (after #1 and after #2)

obraz

You can find the trace files in traces/36538_ExecutionContext

@benaadams
Copy link
Member Author

Get faster; get more contention, go slower? :-/

image

@stephentoub stephentoub added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label May 19, 2020
@stephentoub
Copy link
Member

@benaadams, seems like this should be closed, since for whatever reason we see things getting worse with it?

@benaadams
Copy link
Member Author

seems like this should be closed

Would like to revisit when EH write through is on by default

@stephentoub
Copy link
Member

Sounds good. Let's close it until then. Thanks.

@stephentoub stephentoub closed this Aug 2, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Threading NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants