Improve support for `--cpus` to Docker CLI #23398

luhenry · 2019-03-21T22:09:34Z

This focuses on better supporting Docker CLI's parameter --cpus, which limits the amount of CPU time available to the container (ex: 1.8 means 180% CPU time, ie on 2 cores 90% for each core, on 4 cores 45% on each core, etc.)

All the runtime components depending on the number of processors available are:

ThreadPool
GC
Environment.ProcessorCount via SystemNative::GetProcessorCount
SimpleRWLock::m_spinCount
BaseDomain::m_iNumberOfProcessors (it's used to determine the GC heap to affinitize to)

All the above components take advantage of --cpus via CGroup::GetCpuLimit with #12797, allowing to optimize performance in a container/machine with limited resources. This makes sure the runtime components makes the best use of available resources.

In the case of Environment.ProcessorCount, the behavior is such that passing --cpus=1.5 on a machine with 8 processors will return 1 as shown in #22302 (comment). This behavior is not consistent with Windows Job Objects which still returns the number of processors for the container/machine even if it only gets parts of the total number of cycles.

This behavior is erroneous because the container still has access to the full range of processors on the machine, and only its processor time is limited. For example, in the case of a 4 processors machine, with a value of --cpus=1.8, there can be 4 threads running in parallel even though each thread will only get 1.8 / 8 = .45 or 45% of all cycles of each processor.

The work consist in reverting the behavior of SystemNative::GetProcessorCount to pre #12797.

luhenry · 2019-03-21T22:10:06Z

/cc @sergiy-k

janvorli

LGTM modulo the nit.

src/pal/src/misc/sysinfo.cpp

janvorli · 2019-03-21T22:41:03Z

Actually, there is one more place where we use the PAL_GetCpuLimit - the GetCurrentProcessCpuCount in util.cpp. Can you please remove the call to it from there as well and then remove the function itself (its definition, its declaration in pal.h and also from mscordac_unixexports.src)?

stephentoub · 2019-03-22T01:35:36Z

it also specifies which specific processor we have access to, but that’s irrelevant here

Not relevant to this PR, but should we be surfacing that as part of System.Diagnostics.Process.ProcessorAffinity? Right now we just use sched_getaffinity:
https://github.com/dotnet/corefx/blob/00c395eeff363e0ed890622e068817c8d463a72b/src/System.Diagnostics.Process/src/System/Diagnostics/Process.Linux.cs#L181
https://github.com/dotnet/corefx/blob/b10e8d67b260e26f2e47750cf96669e6f48e774d/src/Native/Unix/System.Native/pal_process.c#L785
I see you're adding relevant code to this PR, so maybe we're already correctly reflecting this in ProcessorAffinity because sched_getaffinity reflects it?

janvorli · 2019-03-22T01:55:46Z

Hmm, I didn't know about the System.Diagnostics.Process.ProcessorAffinity before. It is unfortunate that it represents the Windows way of expressing affinity where a thread can be affinitized to max 64 processors from the same processor group. On Unix, there is no limit like this, a thread can be affinitized to all processors on the system. So the Unix affinity cannot be represented correctly by 64 bit mask for systems with more than 64 CPUs.

stephentoub · 2019-03-22T01:56:41Z

Yup. C'est la vie.

luhenry · 2019-03-22T14:27:13Z

@stephentoub we use sched_getaffinity to count the number of available processors at https://github.com/dotnet/coreclr/pull/23398/files#diff-4be22576babe49f6d25015c61d3f3ad2R109, so using it in Process.ProcessorAffinity is already the right thing to do.

luhenry · 2019-03-22T14:46:01Z

@janvorli I wouldn't want to remove the use of PAL_GetCpuLimit in GetCurrentProcessCpuCount except if we also surface it to whichever caller of GetCurrentProcessCpuCount.

Let's set up a few examples where --cpus has an incidence. Let's say we have a 4 processors machine, with --cpus=1.8 passed to the docker container.

if you have an application with 1 thread that never ever spawn any other thead, then the main and only thread will use 100% of 1 processor, and will only use 1 / 1.8 of its budget.
if you have an application with 2 thread with the 1st thread being very CPU intensive - thus using 100% of 1 processor - and the 2nd thread being CPU intensive on interval - using 100% of 1 processor for 1 second every 5 seconds -, then when only thread 1 is running it will have 100% of 1 processor and when thread 1 and 2 are running, both will have 180% of 2 processors (we can assume the scheduler will distribute equally 90% of each processor to each thread).
if you have 4 threads, all being very CPU intensive then you have a budget of 180% of 4 processors (we can also assume the scheduler will distribute equally 45% of each processor to each thread).

Using PAL_GetCpuLimit allows for minimizing context switches costs by artificially limiting the runtime to a limitied number of threads and thus using the maximum amount of the --cpus budget by a minimal number of threads. That allows us to get closer to example 1. where you have 1 thread using 100% of 1 processor, vs example 3. where you have 4 threads using each 45% of each 4 processors.

Per https://github.com/dotnet/coreclr/issues/22302#issuecomment-475115658, I can understand why we would want to keep the existing behavior at https://github.com/dotnet/coreclr/pull/23398/files#diff-0d2e4fca23d725bbd5e637d1fe57ed0fL346 to allow for the same kind of optimizations without relying on a new Environment.ProcessorQuota. My opinion is we should return the correct value out of Environment.ProcessorCount (ie 4 in the above example, even with a 1.8 CPU budget), but I'll let you be the judge of that.

janvorli · 2019-03-22T15:28:30Z

@luhenry it seems confusing to expose different processor counts for the managed runtime / user code and different for the GC / threadpool. The managed user / runtime code can rely on the number of processors reported by the runtime for similar or the same purposes as the GC or threadpool. To I would tend to think that we should expose the same processor count to both.

Based on the recent discussion, I have switched my view of the limit applied by --cpus to see it as if the CPUs were slown down by a factor of the value of that argument divided by the real number of cores. But from your points above, it also doesn't seem like fully correct analogy.
I think I need to think about it more w.r.t. to whether the count we expose should or should not be limited by the --cpus option.

tmds · 2019-03-22T18:27:56Z

imo, we should look at this from the Kubernetes perspective, and not from the Docker cli perspective.

This is Kubernetes documentation for managing compute resources: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/

tmds · 2019-03-22T18:32:54Z

I have switched my view of the limit applied by --cpus to see it as if the CPUs were slown down by a factor of the value of that argument divided by the real number of core

If I assign a container 0.1 cpu on a 16 core machine, I rather have it run for 10% on 1 core, than for 0.625% on 16. The kernel scheduler should 'smart' about this.

luhenry · 2019-03-22T20:07:14Z

I'll split off the non-controversial part of this PR [1] to another PR, just to get it in while we figure this discussion here.

[1] https://github.com/dotnet/coreclr/pull/23398/files#diff-4be22576babe49f6d25015c61d3f3ad2R105

src/gc/env/gcenv.os.h

src/gc/gc.cpp

luhenry · 2019-03-27T22:26:11Z

@jkotas no particularely good reason for 2 different names. It seems to me Budget is a bit more explicit, but I'll change it to Limit for consistency sake.

src/gc/gc.cpp

tmds · 2019-03-28T05:24:46Z

There is a lot of change in this PR. Instead, couldn't we change these lines to round up?

coreclr/src/pal/src/misc/cgroup.cpp

Lines 103 to 110 in 9f285b7

    
           // Cannot have less than 1 CPU 
        
           if (quota <= period) 
        
           { 
        
               *val = 1; 
        
               return true; 
        
           } 
        
           cpu_count = quota / period;

luhenry · 2019-03-29T14:37:35Z

The test failures look unrelated and have failed 68 times over the last 14 days on master.

@janvorli @Maoni0 @jkotas @tmds did I miss any of your reviews/comments?

src/utilcode/util.cpp

src/vm/win32threadpool.cpp

src/gc/gc.cpp

src/utilcode/util.cpp

luhenry · 2019-03-29T20:20:10Z

To verify the beneficial effect of this change I ran the ASP.NET Core benchmarks with the following configurations and results:

with --cpuset-cpus=4-15 --cpus=8 + without this change

RPS	Max CPU (%) [1]	Avg. Latency	Startup
1,477,768	101	9.79	195
1,514,134	102	9.34	195
1,519,981	101	9.67	194
1,500,781	100	9.55	194
1,500,529	101	9.74	196

with --cpuset-cpus=4-15 --cpus=8 + with this change

RPS	Max CPU (%) [1]	Avg. Latency	Startup
1,534,869	101	8.31	196
1,506,254	101	8.67	196
1,525,930	100	8.68	197
1,506,759	103	9.12	201
1,541,576	102	8.94	197

We can observe a clear gain in RPS and reduction in Avg Latency.

tmds · 2019-03-30T13:11:10Z

src/pal/src/misc/cgroup.cpp

-        
-        cpu_count = quota / period;
+
+        cpu_count = (double) quota / period;


// Cannot have less than 1 CPU if (quota <= period) { *val = 1; return true; } cpu_count = (double) quota / period; if (cpu_count < UINT32_MAX) { // round up *val = (uint32_t)(cpu_count + 0.999999999); } else { *val = UINT32_MAX; }

maybe change this to:

*val = (UINT)(min((quota + period - 1)/period, UINT32_MAX))

tmds · 2019-03-30T13:14:33Z

src/classlibnative/bcltype/system.cpp

-    if (PAL_GetCpuLimit(&cpuLimit) && cpuLimit < (uint32_t)processorCount)
-        processorCount = cpuLimit;
-#endif
-
    END_QCALL;


Do we want to remove this? Doesn't that change Environment.ProcessorCount to no longer take into account --cpus?

Yes, that's the goal of the change.

src/vm/gcenv.os.cpp

stephentoub · 2019-03-30T15:28:00Z

I haven't followed the full discussion, but one thing to confirm:
Will the default min thread pool count still be >= Environment.ProcessorCount? For better or worse, there's a fair amount of code that depends on that.

luhenry · 2019-04-01T16:20:47Z

@stephentoub yes, and that's after discussion with @VSadov at https://github.com/dotnet/coreclr/issues/22302#issuecomment-477386311 and performance measurements that confirms it is the better approach.

@tmds we verified that the resuts are actually better with these changes (including the change to Environment.ProcessorCount). So is there any other reason not to get it in?

stephentoub · 2019-04-01T16:24:48Z

yes

Ok, good, thanks. It ends up being impactful in situations where a system queues Environment.ProcessorCount work items and expects that they will all be able to run concurrently (PLINQ does this for some queries, for example, where the work items it queues all need to join with each other at a barrier). While other work being executed can impact this, at least on an unloaded system if the min thread pool count >= ProcessorCount, the work items should all be able to start and run concurrently quickly. If the min thread pool count < ProcessorCount, then we immediately hit a starvation situation and it could take quite some time for the pool to ramp up to the required number of threads.

luhenry · 2019-04-01T16:36:20Z

The Environment.ProcessorCount workitems queued to the ThreadPool will run concurrently, even though they will not be able to use 100% of each CPU they are running on - that's what --cpus does. The ThreadPool will then be able to take into account this --cpus value with https://github.com/dotnet/coreclr/issues/22302#issuecomment-477386311

stephentoub · 2019-04-01T16:55:46Z

even though they will not be able to use 100% of each CPU they are running on

That's fine.

tmds · 2019-04-01T17:05:57Z

@tmds we verified that the resuts are actually better with these changes (including the change to Environment.ProcessorCount). So is there any other reason not to get it in?

It is not uncommon to run a lot of containers with cpu quota less than 1 on beefy servers with many cores. The benchmarks results posted here don't include that scenario. I think we may be reducing performance in that case.

luhenry · 2019-04-03T17:04:49Z

@tmds following are results running the ASP.NET Core Benchmarks

with --cpuset-cpus=4-15 --cpus=0.5 + without this change

RPS	Max CPU (%) [1]	Avg. Latency	Startup
158940	53	31.77	505
156841	52	31.18	720
159030	52	31.19	509
155849	52	31.55	507
160123	52	31.17	408

with --cpuset-cpus=4-15 --cpus=0.5 + with this change

RPS	Max CPU (%) [1]	Avg. Latency	Startup
154878	54	32.09	430
163776	51	32.09	801
162235	50	32.08	507
148181	51	32.89	440
157830	51	30.26	487

with --cpuset-cpus=4-15 --cpus=1 + without this change

RPS	Max CPU (%) [1]	Avg. Latency	Startup
321827	106	15.13	279
330651	104	11.13	200
321547	104	14.95	321
334414	104	10.56	262
330111	104	13.32	241

with --cpuset-cpus=4-15 --cpus=1 + with this change

RPS	Max CPU (%) [1]	Avg. Latency	Startup
320951	101	13.8	339
323964	105	12.51	288
312130	101	14.19	308
323971	104	12.61	199
330520	102	13.97	242

with --cpuset-cpus=4-15 --cpus=2 + without this change

RPS	Max CPU (%) [1]	Avg. Latency	Startup
647812	104	9.91	242
641200	102	8.36	200
629749	104	11.54	234
631857	101	9.09	200
627933	101	11.19	294

with --cpuset-cpus=4-15 --cpus=2 + with this change

RPS	Max CPU (%) [1]	Avg. Latency	Startup
646432	101	8.98	196
637302	105	9.15	198
654804	101	9.82	200
635966	106	10.01	203
634108	103	9.02	264

with --cpuset-cpus=4-15 --cpus=4 + without this change

RPS	Max CPU (%) [1]	Avg. Latency	Startup
1157659	101	4.75	246
1169138	103	4.8	196
1184439	102	4.72	199
1154286	101	5.65	211
1128476	103	6.57	203

with --cpuset-cpus=4-15 --cpus=4 + with this change

RPS	Max CPU (%) [1]	Avg. Latency	Startup
1202938	105	4.25	199
1139925	101	5.09	228
1155460	101	4.84	200
1125296	102	5.04	200
1165013	102	5.27	215

We can see there is no decrease in performance with the changes, and even see increase in perfornmance with https://github.com/dotnet/coreclr/issues/22302#issuecomment-477330281 and #23398 (comment)

tmds · 2019-04-03T18:45:51Z

@luhenry thanks for doing that benchmark. Unfortunately, I don't feel confident myself about this change yet. It's not clear for me what is now taking into account --cpus and what is not, and why. Also from benchmark results, I wonder if they are a good match with the changes made here. For example, ProcessorCount changed from 1 to 12 in the last benchmark and the results are not really changing. Another thing is that GC is also affected, but these ASP.NET benchmarks have been super optimized to cause very little GCs.

src/System.Private.CoreLib/src/System/Environment.CoreCLR.cs

jkotas · 2019-04-04T21:30:25Z

src/vm/simplerwlock.hpp

@@ -148,7 +148,8 @@ class SimpleRWLock
        } CONTRACTL_END;

        m_RWLock = 0;
-        m_spinCount = (GetCurrentProcessCpuCount() == 1) ? 0 : 4000;
+        // Passing false here reduces ASP.NET Core Plaintext benchmark results from 1.2M to 0.8M RPS.


Just curious - where (stacktrace) does the ASP.NET Core Plaintext take this lock?

This focuses on better supporting Docker CLI's parameter `--cpus`, which limits the amount of CPU time available to the container (ex: 1.8 means 180% CPU time, ie on 2 cores 90% for each core, on 4 cores 45% on each core, etc.) All the runtime components depending on the number of processors available are: - ThreadPool - GC - `Environment.ProcessorCount` via `SystemNative::GetProcessorCount` - `SimpleRWLock::m_spinCount` - `BaseDomain::m_iNumberOfProcessors` (it's used to determine the GC heap to affinitize to) All the above components take advantage of `--cpus` via `CGroup::GetCpuLimit` with dotnet#12797, allowing to optimize performance in a container/machine with limited resources. This makes sure the runtime components makes the best use of available resources. In the case of `Environment.ProcessorCount`, the behavior is such that passing `--cpus=1.5` on a machine with 8 processors will return `1` as shown in https://github.com/dotnet/coreclr/issues/22302#issuecomment-459092299. This behavior is not consistent with [Windows Job Objects](https://docs.microsoft.com/en-us/windows/desktop/api/winnt/ns-winnt-jobobject_cpu_rate_control_information) which still returns the number of processors for the container/machine even if it only gets parts of the total number of cycles. This behavior is erroneous because the container still has access to the full range of processors on the machine, and only its _processor time_ is limited. For example, in the case of a 4 processors machine, with a value of `--cpus=1.8`, there can be 4 threads running in parallel even though each thread will only get `1.8 / 8 = .45` or 45% of all cycles of each processor. The work consist in reverting the behavior of `SystemNative::GetProcessorCount` to pre dotnet#12797.

luhenry requested a review from janvorli March 21, 2019 22:09

janvorli approved these changes Mar 21, 2019

View reviewed changes

src/pal/src/misc/sysinfo.cpp Outdated Show resolved Hide resolved

luhenry force-pushed the fix-gh22302 branch from 1742f46 to 1dc6905 Compare March 22, 2019 20:17

luhenry changed the title ~~Improve detection of CPU limits when running inside a Container~~ Fix Environment.ProcessorCount when passing --cpus to Docker CLI Mar 22, 2019

luhenry force-pushed the fix-gh22302 branch 5 times, most recently from 58443dc to 9d70701 Compare March 27, 2019 20:25

jkotas requested a review from Maoni0 March 27, 2019 22:04

jkotas reviewed Mar 27, 2019

View reviewed changes

src/gc/env/gcenv.os.h Outdated Show resolved Hide resolved

Maoni0 reviewed Mar 27, 2019

View reviewed changes

src/gc/gc.cpp Outdated Show resolved Hide resolved

janvorli reviewed Mar 27, 2019

View reviewed changes

src/gc/gc.cpp Outdated Show resolved Hide resolved

luhenry force-pushed the fix-gh22302 branch from 9d70701 to b22df29 Compare March 28, 2019 15:38

janvorli reviewed Mar 29, 2019

View reviewed changes

src/utilcode/util.cpp Outdated Show resolved Hide resolved

src/vm/win32threadpool.cpp Outdated Show resolved Hide resolved

src/gc/gc.cpp Outdated Show resolved Hide resolved

src/utilcode/util.cpp Outdated Show resolved Hide resolved

luhenry force-pushed the fix-gh22302 branch from b22df29 to 8f206df Compare March 29, 2019 16:44

luhenry changed the title ~~Fix Environment.ProcessorCount when passing --cpus to Docker CLI~~ Improve support for --cpus to Docker CLI Mar 29, 2019

tmds reviewed Mar 30, 2019

View reviewed changes

src/vm/gcenv.os.cpp Outdated Show resolved Hide resolved

luhenry force-pushed the fix-gh22302 branch from 8f206df to 63831df Compare April 2, 2019 20:48

jkotas reviewed Apr 4, 2019

View reviewed changes

src/System.Private.CoreLib/src/System/Environment.CoreCLR.cs Outdated Show resolved Hide resolved

jkotas reviewed Apr 4, 2019

View reviewed changes

luhenry mentioned this pull request Apr 4, 2019

Partially improve support for --cpus from Docker CLI #23747

Merged

luhenry force-pushed the fix-gh22302 branch from 63831df to 32bbe0e Compare April 5, 2019 21:08

luhenry closed this Apr 23, 2019

tmds mentioned this pull request Dec 10, 2019

Environment.ProcessorCount incorrect reporting in containers in 3.1 dotnet/runtime#622

Closed

msallin mentioned this pull request Aug 31, 2020

Question: Enviornment.ProcessorCount, Kubernetes/OpenShift and ThreadPool Min Threads dotnet/runtime#41586

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve support for `--cpus` to Docker CLI #23398

Improve support for `--cpus` to Docker CLI #23398

luhenry commented Mar 21, 2019 •

edited

luhenry commented Mar 21, 2019

janvorli left a comment

janvorli commented Mar 21, 2019

stephentoub commented Mar 22, 2019 •

edited

janvorli commented Mar 22, 2019 •

edited

stephentoub commented Mar 22, 2019

luhenry commented Mar 22, 2019

luhenry commented Mar 22, 2019

janvorli commented Mar 22, 2019

tmds commented Mar 22, 2019

tmds commented Mar 22, 2019 •

edited

luhenry commented Mar 22, 2019

luhenry commented Mar 27, 2019

tmds commented Mar 28, 2019

luhenry commented Mar 29, 2019

luhenry commented Mar 29, 2019

tmds Mar 30, 2019

tmds Mar 30, 2019

luhenry Apr 4, 2019

stephentoub commented Mar 30, 2019

luhenry commented Apr 1, 2019

stephentoub commented Apr 1, 2019

luhenry commented Apr 1, 2019

stephentoub commented Apr 1, 2019

tmds commented Apr 1, 2019

luhenry commented Apr 3, 2019 •

edited

tmds commented Apr 3, 2019

jkotas Apr 4, 2019 •

edited


		cpu_count = quota / period;

		cpu_count = (double) quota / period;

Improve support for --cpus to Docker CLI #23398

Improve support for --cpus to Docker CLI #23398

Conversation

luhenry commented Mar 21, 2019 • edited

luhenry commented Mar 21, 2019

janvorli left a comment

Choose a reason for hiding this comment

janvorli commented Mar 21, 2019

stephentoub commented Mar 22, 2019 • edited

janvorli commented Mar 22, 2019 • edited

stephentoub commented Mar 22, 2019

luhenry commented Mar 22, 2019

luhenry commented Mar 22, 2019

janvorli commented Mar 22, 2019

tmds commented Mar 22, 2019

tmds commented Mar 22, 2019 • edited

luhenry commented Mar 22, 2019

luhenry commented Mar 27, 2019

tmds commented Mar 28, 2019

luhenry commented Mar 29, 2019

luhenry commented Mar 29, 2019

tmds Mar 30, 2019

Choose a reason for hiding this comment

tmds Mar 30, 2019

Choose a reason for hiding this comment

luhenry Apr 4, 2019

Choose a reason for hiding this comment

stephentoub commented Mar 30, 2019

luhenry commented Apr 1, 2019

stephentoub commented Apr 1, 2019

luhenry commented Apr 1, 2019

stephentoub commented Apr 1, 2019

tmds commented Apr 1, 2019

luhenry commented Apr 3, 2019 • edited

tmds commented Apr 3, 2019

jkotas Apr 4, 2019 • edited

Choose a reason for hiding this comment

Improve support for `--cpus` to Docker CLI #23398

Improve support for `--cpus` to Docker CLI #23398

luhenry commented Mar 21, 2019 •

edited

stephentoub commented Mar 22, 2019 •

edited

janvorli commented Mar 22, 2019 •

edited

tmds commented Mar 22, 2019 •

edited

luhenry commented Apr 3, 2019 •

edited

jkotas Apr 4, 2019 •

edited