Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hysteresis effect on threadpool hill-climbing #51935

Closed
sebastienros opened this issue Apr 27, 2021 · 10 comments · Fixed by #52397
Closed

Hysteresis effect on threadpool hill-climbing #51935

sebastienros opened this issue Apr 27, 2021 · 10 comments · Fixed by #52397
Labels
Milestone

Comments

@sebastienros
Copy link
Member

We have noticed a periodic pattern on the threadpool hill-climbing logic, which uses either n-cores or n-cores + 20 with an hysteresis effect that switches every 3-4 weeks:

image

The main visible impact is on performance results, here is an example with JsonPlatform mean latency, but some scenarios are also impacted in throughput:

image

This happens independently of the runtime version, meaning that using an older runtime/aspnet/sdk doesn't change the "current" value of the TP threads.

It is also independent of the hardware, and happens on all machines (Linux only) on the same day. These machines have auto-updates disabled. Here are ARM64 (32 cores), AMD (48 cores), INTEL (28 cores):

image

Disabling hill-climbing restores the better perf in this case, so it is believe that fixing this variation will actually have a negative impact on perf for these scenarios.

@sebastienros sebastienros added the tenet-performance Performance related issue label Apr 27, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Apr 27, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label May 4, 2021
@mangod9 mangod9 added this to the 6.0.0 milestone May 4, 2021
@KevinCathcart
Copy link
Contributor

KevinCathcart commented May 4, 2021

The whole period of square wave sounds an awful lot like it is around 49.7 days. That is how long it takes GetTickCount() to wrap around. On POSIX platforms the platform abstraction layer implements this, and the value returned for that is based not on uptime but on wall clock time, which matches with all machines changing on the same day.

In managed code GetTickCount is exposed as Environment.TickCount a signed integer. The pasted chart's Y axis is provides no precision but it looks the capped thread counts at cpu count is happening when Environment.ThreadCount is negative, and the +20 is happening when it is positive.

If I am right, the change over dates would be (give or take a day or so due to timezones, what time of day data was taken, etc):
Thu Jan 14 2021
Sun Feb 07 2021
Thu Mar 04 2021
Mon Mar 29 2021
Fri Apr 23 2021

What is unclear to me is where the bug is. All the TickCount usages in the portable thread pool code appear at first glance to be doing the right thing. But if these dates match up, then this sort of thing is pretty clearly the cause. It could potentially be a different api from Environment.TickCount that is also derived from GetTickCount().

Hope this helps.

@sebastienros
Copy link
Member Author

@KevinCathcart I can confirm the dates match the ones you provided.

@KevinCathcart
Copy link
Contributor

KevinCathcart commented May 5, 2021

Ok, found a bug that could definitely be causing this:

private bool ShouldAdjustMaxWorkersActive(int currentTimeMs)
{
// We need to subtract by prior time because Environment.TickCount can wrap around, making a comparison of absolute times unreliable.
int priorTime = Volatile.Read(ref _separated.priorCompletedWorkRequestsTime);
int requiredInterval = _separated.nextCompletedWorkRequestsTime - priorTime;
int elapsedInterval = currentTimeMs - priorTime;
if (elapsedInterval >= requiredInterval)
{

currentTimeMs is Environment.TickCount, which in this case happens to be negative.

The if clause controls if the hill climbing is even run.

_separated.priorCompletedWorkRequestsTime and _separated.nextCompletedWorkRequestsTime start out as zero on process start, and only get updated if the hill climbing code is run.

Therefore, requiredInterval = 0 - 0 and elapsedInterval = negativeNumber - 0. This causes the if statement to become
if (negativeNumber - 0 >= 0 - 0) which returns false, so the hill climbing code is not run, and therefore the variables never get updated, and remain zero. The native version of the thread pool code does all math with unsigned numbers which would avoid such a bug, and it's equivalent part is not even quite the same math in the first place.

The easy fix here is probably to use unsigned arithmetic, but alternatively having the two fields get initialized to Environment.TickCount probably also works.

@danmoseley
Copy link
Member

@jkoritzinsky

@jkoritzinsky
Copy link
Member

cc: @kouvel

@kouvel
Copy link
Member

kouvel commented May 5, 2021

Nice find, thanks! I'll put up a PR to fix it.

@sebastienros
Copy link
Member Author

@KevinCathcart you passed the interview, start on Monday?

kouvel added a commit to kouvel/runtime that referenced this issue May 6, 2021
Updated the time intervals used in the hill climbing check to be unsigned to prevent wrap-around. This also prevents the check from failing when the current time is negative since the prior and next times are initialized to zero, and matches the previous implementation.

Fixes dotnet#51935
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label May 6, 2021
kouvel added a commit that referenced this issue May 6, 2021
Updated the time intervals used in the hill climbing check to be unsigned to prevent wrap-around. This also prevents the check from failing when the current time is negative since the prior and next times are initialized to zero, and matches the previous implementation.

Fixes #51935
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 6, 2021
@GSPP
Copy link

GSPP commented May 8, 2021

In my estimation, essentially all uses of Environment.TickCount in the BCL are bugs and should be changed. Using TickCount makes the framework brittle for use in very long-running processes.

On POSIX platforms the platform abstraction layer implements this, and the value returned for that is based not on uptime but on wall clock time, which matches with all machines changing on the same day.

Is it true that in a freshly started .NET process, Environment.TickCount can be negative? That sounds like a significant breaking change. It also seems to be not good regardless of the breaking.

@jkotas
Copy link
Member

jkotas commented May 8, 2021

Is it true that in a freshly started .NET process, Environment.TickCount can be negative?

Yes. It is like that since .NET Framework 1.0 and documentation explicitly mentions it. Nothing really broke here. I agree that it is a trap one has to be careful about.

@ghost ghost locked as resolved and limited conversation to collaborators Jun 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants