Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf] Widespread regressions after threadpool change #44211

Closed
DrewScoggins opened this issue Nov 3, 2020 · 16 comments · Fixed by #45901
Closed

[Perf] Widespread regressions after threadpool change #44211

DrewScoggins opened this issue Nov 3, 2020 · 16 comments · Fixed by #45901
Assignees
Labels
arch-x64 arch-x86 area-System.Threading os-linux Linux OS (any supported distro) os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark
Milestone

Comments

@DrewScoggins
Copy link
Member

DrewScoggins commented Nov 3, 2020

After #43841, we are seeing widespread (about 40 tests) regress. Some of these are threading tests, but some seem to have little to do with threading, like DrewScoggins/performance-2#3177. Below I have a link to a query that is all of the regressions and a few improvements that are linked to this change. We are seeing this across Windows x64, x86 and Ubuntu x64, and I will be updating this issue as we get the Windows x86, and Ubuntu x64 regressions categorized.

Regression Query - Windows x64

Improvement Query - Windows x64

@DrewScoggins DrewScoggins added arch-x86 os-linux Linux OS (any supported distro) os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark arch-x64 labels Nov 3, 2020
@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Nov 3, 2020
@DrewScoggins DrewScoggins changed the title [Perf] Widespread regressions from threadpool change [Perf] Widespread regressions after threadpool change Nov 3, 2020
@kunalspathak
Copy link
Member

cc: @kouvel , @mangod9

@mangod9 mangod9 added area-System.Threading and removed untriaged New issue has not been triaged by the area owner labels Nov 3, 2020
@mangod9 mangod9 added this to the 6.0.0 milestone Nov 3, 2020
@mangod9 mangod9 added this to In progress + Investigations in Core-Runtime .net 9 Nov 3, 2020
@mangod9
Copy link
Member

mangod9 commented Nov 3, 2020

Kount will take a look, this possibly might require to update the baselines since a startup hit might be causing some increase in per iteration times?

@kunalspathak
Copy link
Member

Kount will take a look, this possibly might require to update the baselines since a startup hit might be causing some increase in per iteration times?

@adamsitnik

@AndyAyersMS
Copy link
Member

BDN tests should generally be pretty well-insulated from startup effects. We also considered that the change might impact BDN itself, but that seems unlikely.

As Drew noted, a number of the regressions are in tests that don't seem to have anything to do with threading; this is quite puzzling. We also see instructions retired increases, so these regressions don't seem to be code alignment artifacts. Could be data alignment, but the regressions have been pretty stable. So it seems to suggest there must have been some path length changes in code that's not exclusively threading related.

@kouvel
Copy link
Member

kouvel commented Nov 6, 2020

Part of the problem appears to be slower stack walking on the thread pool thread when performing a GC. Even though most of these tests don't use the thread pool, tiered compilation currently uses the thread pool to do its background work, and thread pool thread when waiting for work would have managed frames with GC refs. I could switch tiering to not use the thread pool, I was thinking about that anyway as it would give more control to choosing much CPU time to use in a stretch for rejits just using sleeps and without adding overhead from the thread pool queues and dispatching logic.

@mangod9
Copy link
Member

mangod9 commented Nov 19, 2020

would it be possible to validate that theory of "stackwalking on the threadpool thread" causing the regression, by disabling the portable threadpool via config?

@kouvel
Copy link
Member

kouvel commented Nov 19, 2020

Yes it's been verified like that on a few cases, and it looks like that would be the cause for most if not all of the tests, I just haven't verified it on all of them yet, that's pending

@kouvel
Copy link
Member

kouvel commented Nov 19, 2020

Although verifying like that would not tell whether it's the same issue for all of the tests. Would need profiles for each test.

@mangod9
Copy link
Member

mangod9 commented Nov 19, 2020

yeah validating against a few tests is fine, all are not required. We can create a tracking item to switch tiering to not use the TP. In the interim is there a way to rebase the tests to the new "baseline" -- @DrewScoggins ?

@DrewScoggins
Copy link
Member Author

I am not sure exactly what you mean. We can take the PR and run it through the perf lab and generate a report to see if we are seeing improvements across the tests that we saw regressions in before. Or we can just test it locally on a few tests then check it in and then generate reports from the lab to make sure we got everything back.

@mangod9
Copy link
Member

mangod9 commented Nov 19, 2020

I meant the fix would probably take some time to implement, so wondering if we can rebase the perf baseline, so it catches any further regressions which might happen on top on this.

@DrewScoggins
Copy link
Member Author

Oh, I understand now. The system already does this. So if we see any further regressions in this test we would flag them as a regression and report them.

@sebastienros
Copy link
Member

Just to add some data, most aspnet benchmarks are also down on windows with this change. Some also on linux but not as much.

image

@kouvel
Copy link
Member

kouvel commented Dec 7, 2020

Filed a separate issue for the ASP.NET TE Windows perf issues, #45716. I don't think it's the same cause as this one.

@kouvel
Copy link
Member

kouvel commented Dec 10, 2020

The following seem to yield different results in different runs even in the same config mode (though one result is more common than the other). May be depending on JIT timing, or due to other known issues like alignment of methods or loops, they're not related to #43841. JIT timing can be fairly stable and following other changes it can change to different timing but still fairly stable. Not considering them as regressions for now.

Keeping this issue for the issue of slower stack walking mentioned above, filed a separate issue (#45885) for a few regressions that aren't caused by the slower stack walking.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Dec 10, 2020
kouvel added a commit to kouvel/runtime that referenced this issue Jan 19, 2021
- Makes it easier to manage how much time is spend for performing background work like rejitting and allows yielding more frequently with just Sleep without incurring thread pool overhead, which is useful in CPU-limited cases
- A min/max range is determined for how long background work will be done before yielding the thread. The max is the same as before, 50 ms. For now the min is `processor count` ms (capped to the max), such that in CPU-limited cases the thread would yield more frequently in order to not monopolize too much of the limited CPU resources for background work, and in cases with a larger number of processors where the background work is typically less intrusive to foreground work it would yield less frequently.
- At the same time, progress should be made on background work such that steady-state perf would be reached in reasonable time. Yielding too frequently can slow down the background work too much. The sleep duration is measured to identify oversubscribed situations to yield less frequently and make faster progress on the background work.
- Due to less time spent rejitting in some CPU-limited cases, steady-state performance may be reached a bit later in favor of fewer spikes along the way
- When the portable thread pool is enabled, a side effect of using a managed worker thread for tiering background work was that several GC-heavy microbenchmarks regressed. Tiering was the only thing using the thread pool in those tests and stack-walking the managed thread was slower due to the presence of GC refs. It's not too concerning, the benchmarks are just measuring something different from before, but in any case this change also resolves that issue. Fixes dotnet#44211.
kouvel added a commit that referenced this issue Jan 26, 2021
Use a separate thread for tiered compilation background work

- Makes it easier to manage how much time is spend for performing background work like rejitting and allows yielding more frequently with just Sleep without incurring thread pool overhead, which is useful in CPU-limited cases
- A min/max range is determined for how long background work will be done before yielding the thread. The max is the same as before, 50 ms. For now the min is `processor count` ms (capped to the max), such that in CPU-limited cases the thread would yield more frequently in order to not monopolize too much of the limited CPU resources for background work, and in cases with a larger number of processors where the background work is typically less intrusive to foreground work it would yield less frequently.
- At the same time, progress should be made on background work such that steady-state perf would be reached in reasonable time. Yielding too frequently can slow down the background work too much. The sleep duration is measured to identify oversubscribed situations to yield less frequently and make faster progress on the background work.
- Due to less time spent rejitting in some CPU-limited cases, steady-state performance may be reached a bit later in favor of fewer spikes along the way
- When the portable thread pool is enabled, a side effect of using a managed worker thread for tiering background work was that several GC-heavy microbenchmarks regressed. Tiering was the only thing using the thread pool in those tests and stack-walking the managed thread was slower due to the presence of GC refs. It's not too concerning, the benchmarks are just measuring something different from before, but in any case this change also resolves that issue. Fixes #44211.
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jan 26, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Feb 25, 2021
@kouvel kouvel moved this from In progress + Investigations to Complete in Core-Runtime .net 9 Feb 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x64 arch-x86 area-System.Threading os-linux Linux OS (any supported distro) os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants