Hysteresis effect on threadpool hill-climbing #51935

sebastienros · 2021-04-27T16:02:21Z

We have noticed a periodic pattern on the threadpool hill-climbing logic, which uses either n-cores or n-cores + 20 with an hysteresis effect that switches every 3-4 weeks:

The main visible impact is on performance results, here is an example with JsonPlatform mean latency, but some scenarios are also impacted in throughput:

This happens independently of the runtime version, meaning that using an older runtime/aspnet/sdk doesn't change the "current" value of the TP threads.

It is also independent of the hardware, and happens on all machines (Linux only) on the same day. These machines have auto-updates disabled. Here are ARM64 (32 cores), AMD (48 cores), INTEL (28 cores):

Disabling hill-climbing restores the better perf in this case, so it is believe that fixing this variation will actually have a negative impact on perf for these scenarios.

The text was updated successfully, but these errors were encountered:

dotnet-issue-labeler · 2021-04-27T16:02:24Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

KevinCathcart · 2021-05-04T18:09:54Z

The whole period of square wave sounds an awful lot like it is around 49.7 days. That is how long it takes GetTickCount() to wrap around. On POSIX platforms the platform abstraction layer implements this, and the value returned for that is based not on uptime but on wall clock time, which matches with all machines changing on the same day.

In managed code GetTickCount is exposed as Environment.TickCount a signed integer. The pasted chart's Y axis is provides no precision but it looks the capped thread counts at cpu count is happening when Environment.ThreadCount is negative, and the +20 is happening when it is positive.

If I am right, the change over dates would be (give or take a day or so due to timezones, what time of day data was taken, etc):
Thu Jan 14 2021
Sun Feb 07 2021
Thu Mar 04 2021
Mon Mar 29 2021
Fri Apr 23 2021

What is unclear to me is where the bug is. All the TickCount usages in the portable thread pool code appear at first glance to be doing the right thing. But if these dates match up, then this sort of thing is pretty clearly the cause. It could potentially be a different api from Environment.TickCount that is also derived from GetTickCount().

Hope this helps.

sebastienros · 2021-05-04T18:16:49Z

@KevinCathcart I can confirm the dates match the ones you provided.

KevinCathcart · 2021-05-05T01:48:04Z

Ok, found a bug that could definitely be causing this:

runtime/src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.cs

Lines 328 to 335 in 59d9d0d

    
           private bool ShouldAdjustMaxWorkersActive(int currentTimeMs) 
        
           { 
        
               // We need to subtract by prior time because Environment.TickCount can wrap around, making a comparison of absolute times unreliable. 
        
               int priorTime = Volatile.Read(ref _separated.priorCompletedWorkRequestsTime); 
        
               int requiredInterval = _separated.nextCompletedWorkRequestsTime - priorTime; 
        
               int elapsedInterval = currentTimeMs - priorTime; 
        
               if (elapsedInterval >= requiredInterval) 
        
               {

currentTimeMs is Environment.TickCount, which in this case happens to be negative.

The if clause controls if the hill climbing is even run.

_separated.priorCompletedWorkRequestsTime and _separated.nextCompletedWorkRequestsTime start out as zero on process start, and only get updated if the hill climbing code is run.

Therefore, requiredInterval = 0 - 0 and elapsedInterval = negativeNumber - 0. This causes the if statement to become
if (negativeNumber - 0 >= 0 - 0) which returns false, so the hill climbing code is not run, and therefore the variables never get updated, and remain zero. The native version of the thread pool code does all math with unsigned numbers which would avoid such a bug, ~~and it's equivalent part is not even quite the same math in the first place~~.

The easy fix here is probably to use unsigned arithmetic, but alternatively having the two fields get initialized to Environment.TickCount probably also works.

danmoseley · 2021-05-05T05:15:05Z

@jkoritzinsky

jkoritzinsky · 2021-05-05T06:12:07Z

cc: @kouvel

kouvel · 2021-05-05T14:25:37Z

Nice find, thanks! I'll put up a PR to fix it.

sebastienros · 2021-05-05T16:06:05Z

@KevinCathcart you passed the interview, start on Monday?

Updated the time intervals used in the hill climbing check to be unsigned to prevent wrap-around. This also prevents the check from failing when the current time is negative since the prior and next times are initialized to zero, and matches the previous implementation. Fixes dotnet#51935

Updated the time intervals used in the hill climbing check to be unsigned to prevent wrap-around. This also prevents the check from failing when the current time is negative since the prior and next times are initialized to zero, and matches the previous implementation. Fixes #51935

GSPP · 2021-05-08T07:37:14Z

In my estimation, essentially all uses of Environment.TickCount in the BCL are bugs and should be changed. Using TickCount makes the framework brittle for use in very long-running processes.

On POSIX platforms the platform abstraction layer implements this, and the value returned for that is based not on uptime but on wall clock time, which matches with all machines changing on the same day.

Is it true that in a freshly started .NET process, Environment.TickCount can be negative? That sounds like a significant breaking change. It also seems to be not good regardless of the breaking.

jkotas · 2021-05-08T14:00:45Z

Is it true that in a freshly started .NET process, Environment.TickCount can be negative?

Yes. It is like that since .NET Framework 1.0 and documentation explicitly mentions it. Nothing really broke here. I agree that it is a trap one has to be careful about.

sebastienros added the tenet-performance Performance related issue label Apr 27, 2021

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Apr 27, 2021

jkotas added the area-System.Threading label Apr 27, 2021

mangod9 removed the untriaged New issue has not been triaged by the area owner label May 4, 2021

mangod9 added this to the 6.0.0 milestone May 4, 2021

ModestGoblin mentioned this issue May 6, 2021

Ok, found a bug that could definitely be causing this: Third Quarter Moon #52367

Closed

kouvel mentioned this issue May 6, 2021

Fix for hill climbing not working sometimes #52397

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label May 6, 2021

sebastienros mentioned this issue May 6, 2021

Investigate Environment.TickCount64 potential issues #52407

Closed

kouvel closed this as completed in #52397 May 6, 2021

ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 6, 2021

ghost locked as resolved and limited conversation to collaborators Jun 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hysteresis effect on threadpool hill-climbing #51935

Hysteresis effect on threadpool hill-climbing #51935

sebastienros commented Apr 27, 2021

dotnet-issue-labeler bot commented Apr 27, 2021

KevinCathcart commented May 4, 2021 •

edited

Loading

sebastienros commented May 4, 2021

KevinCathcart commented May 5, 2021 •

edited

Loading

danmoseley commented May 5, 2021

jkoritzinsky commented May 5, 2021

kouvel commented May 5, 2021

sebastienros commented May 5, 2021

GSPP commented May 8, 2021

jkotas commented May 8, 2021

Hysteresis effect on threadpool hill-climbing #51935

Hysteresis effect on threadpool hill-climbing #51935

Comments

sebastienros commented Apr 27, 2021

dotnet-issue-labeler bot commented Apr 27, 2021

KevinCathcart commented May 4, 2021 • edited Loading

sebastienros commented May 4, 2021

KevinCathcart commented May 5, 2021 • edited Loading

danmoseley commented May 5, 2021

jkoritzinsky commented May 5, 2021

kouvel commented May 5, 2021

sebastienros commented May 5, 2021

GSPP commented May 8, 2021

jkotas commented May 8, 2021

KevinCathcart commented May 4, 2021 •

edited

Loading

KevinCathcart commented May 5, 2021 •

edited

Loading