Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Improve support for --cpus to Docker CLI #23398

Closed
wants to merge 1 commit into from

Conversation

luhenry
Copy link

@luhenry luhenry commented Mar 21, 2019

This focuses on better supporting Docker CLI's parameter --cpus, which limits the amount of CPU time available to the container (ex: 1.8 means 180% CPU time, ie on 2 cores 90% for each core, on 4 cores 45% on each core, etc.)

All the runtime components depending on the number of processors available are:

  • ThreadPool
  • GC
  • Environment.ProcessorCount via SystemNative::GetProcessorCount
  • SimpleRWLock::m_spinCount
  • BaseDomain::m_iNumberOfProcessors (it's used to determine the GC heap to affinitize to)

All the above components take advantage of --cpus via CGroup::GetCpuLimit with #12797, allowing to optimize performance in a container/machine with limited resources. This makes sure the runtime components makes the best use of available resources.

In the case of Environment.ProcessorCount, the behavior is such that passing --cpus=1.5 on a machine with 8 processors will return 1 as shown in #22302 (comment). This behavior is not consistent with Windows Job Objects which still returns the number of processors for the container/machine even if it only gets parts of the total number of cycles.

This behavior is erroneous because the container still has access to the full range of processors on the machine, and only its processor time is limited. For example, in the case of a 4 processors machine, with a value of --cpus=1.8, there can be 4 threads running in parallel even though each thread will only get 1.8 / 8 = .45 or 45% of all cycles of each processor.

The work consist in reverting the behavior of SystemNative::GetProcessorCount to pre #12797.

@luhenry luhenry requested a review from janvorli March 21, 2019 22:09
@luhenry
Copy link
Author

luhenry commented Mar 21, 2019

/cc @sergiy-k

Copy link
Member

@janvorli janvorli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo the nit.

src/pal/src/misc/sysinfo.cpp Outdated Show resolved Hide resolved
@janvorli
Copy link
Member

Actually, there is one more place where we use the PAL_GetCpuLimit - the GetCurrentProcessCpuCount in util.cpp. Can you please remove the call to it from there as well and then remove the function itself (its definition, its declaration in pal.h and also from mscordac_unixexports.src)?

@stephentoub
Copy link
Member

stephentoub commented Mar 22, 2019

it also specifies which specific processor we have access to, but that’s irrelevant here

Not relevant to this PR, but should we be surfacing that as part of System.Diagnostics.Process.ProcessorAffinity? Right now we just use sched_getaffinity:
https://github.com/dotnet/corefx/blob/00c395eeff363e0ed890622e068817c8d463a72b/src/System.Diagnostics.Process/src/System/Diagnostics/Process.Linux.cs#L181
https://github.com/dotnet/corefx/blob/b10e8d67b260e26f2e47750cf96669e6f48e774d/src/Native/Unix/System.Native/pal_process.c#L785
I see you're adding relevant code to this PR, so maybe we're already correctly reflecting this in ProcessorAffinity because sched_getaffinity reflects it?

@janvorli
Copy link
Member

janvorli commented Mar 22, 2019

Hmm, I didn't know about the System.Diagnostics.Process.ProcessorAffinity before. It is unfortunate that it represents the Windows way of expressing affinity where a thread can be affinitized to max 64 processors from the same processor group. On Unix, there is no limit like this, a thread can be affinitized to all processors on the system. So the Unix affinity cannot be represented correctly by 64 bit mask for systems with more than 64 CPUs.

@stephentoub
Copy link
Member

Yup. C'est la vie.

@luhenry
Copy link
Author

luhenry commented Mar 22, 2019

@stephentoub we use sched_getaffinity to count the number of available processors at https://github.com/dotnet/coreclr/pull/23398/files#diff-4be22576babe49f6d25015c61d3f3ad2R109, so using it in Process.ProcessorAffinity is already the right thing to do.

@luhenry
Copy link
Author

luhenry commented Mar 22, 2019

@janvorli I wouldn't want to remove the use of PAL_GetCpuLimit in GetCurrentProcessCpuCount except if we also surface it to whichever caller of GetCurrentProcessCpuCount.

Let's set up a few examples where --cpus has an incidence. Let's say we have a 4 processors machine, with --cpus=1.8 passed to the docker container.

  1. if you have an application with 1 thread that never ever spawn any other thead, then the main and only thread will use 100% of 1 processor, and will only use 1 / 1.8 of its budget.
  2. if you have an application with 2 thread with the 1st thread being very CPU intensive - thus using 100% of 1 processor - and the 2nd thread being CPU intensive on interval - using 100% of 1 processor for 1 second every 5 seconds -, then when only thread 1 is running it will have 100% of 1 processor and when thread 1 and 2 are running, both will have 180% of 2 processors (we can assume the scheduler will distribute equally 90% of each processor to each thread).
  3. if you have 4 threads, all being very CPU intensive then you have a budget of 180% of 4 processors (we can also assume the scheduler will distribute equally 45% of each processor to each thread).

Using PAL_GetCpuLimit allows for minimizing context switches costs by artificially limiting the runtime to a limitied number of threads and thus using the maximum amount of the --cpus budget by a minimal number of threads. That allows us to get closer to example 1. where you have 1 thread using 100% of 1 processor, vs example 3. where you have 4 threads using each 45% of each 4 processors.

Per https://github.com/dotnet/coreclr/issues/22302#issuecomment-475115658, I can understand why we would want to keep the existing behavior at https://github.com/dotnet/coreclr/pull/23398/files#diff-0d2e4fca23d725bbd5e637d1fe57ed0fL346 to allow for the same kind of optimizations without relying on a new Environment.ProcessorQuota. My opinion is we should return the correct value out of Environment.ProcessorCount (ie 4 in the above example, even with a 1.8 CPU budget), but I'll let you be the judge of that.

@janvorli
Copy link
Member

@luhenry it seems confusing to expose different processor counts for the managed runtime / user code and different for the GC / threadpool. The managed user / runtime code can rely on the number of processors reported by the runtime for similar or the same purposes as the GC or threadpool. To I would tend to think that we should expose the same processor count to both.

Based on the recent discussion, I have switched my view of the limit applied by --cpus to see it as if the CPUs were slown down by a factor of the value of that argument divided by the real number of cores. But from your points above, it also doesn't seem like fully correct analogy.
I think I need to think about it more w.r.t. to whether the count we expose should or should not be limited by the --cpus option.

@tmds
Copy link
Member

tmds commented Mar 22, 2019

imo, we should look at this from the Kubernetes perspective, and not from the Docker cli perspective.

This is Kubernetes documentation for managing compute resources: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/

@tmds
Copy link
Member

tmds commented Mar 22, 2019

I have switched my view of the limit applied by --cpus to see it as if the CPUs were slown down by a factor of the value of that argument divided by the real number of core

If I assign a container 0.1 cpu on a 16 core machine, I rather have it run for 10% on 1 core, than for 0.625% on 16. The kernel scheduler should 'smart' about this.

@luhenry
Copy link
Author

luhenry commented Mar 22, 2019

I'll split off the non-controversial part of this PR [1] to another PR, just to get it in while we figure this discussion here.

[1] https://github.com/dotnet/coreclr/pull/23398/files#diff-4be22576babe49f6d25015c61d3f3ad2R105

@luhenry luhenry changed the title Improve detection of CPU limits when running inside a Container Fix Environment.ProcessorCount when passing --cpus to Docker CLI Mar 22, 2019
@luhenry luhenry force-pushed the fix-gh22302 branch 5 times, most recently from 58443dc to 9d70701 Compare March 27, 2019 20:25
@jkotas jkotas requested a review from Maoni0 March 27, 2019 22:04
src/gc/env/gcenv.os.h Outdated Show resolved Hide resolved
src/gc/gc.cpp Outdated Show resolved Hide resolved
@luhenry
Copy link
Author

luhenry commented Mar 27, 2019

@jkotas no particularely good reason for 2 different names. It seems to me Budget is a bit more explicit, but I'll change it to Limit for consistency sake.

src/gc/gc.cpp Outdated Show resolved Hide resolved
@tmds
Copy link
Member

tmds commented Mar 28, 2019

There is a lot of change in this PR. Instead, couldn't we change these lines to round up?

// Cannot have less than 1 CPU
if (quota <= period)
{
*val = 1;
return true;
}
cpu_count = quota / period;

@luhenry
Copy link
Author

luhenry commented Mar 29, 2019

The test failures look unrelated and have failed 68 times over the last 14 days on master.

@janvorli @Maoni0 @jkotas @tmds did I miss any of your reviews/comments?

src/utilcode/util.cpp Outdated Show resolved Hide resolved
src/vm/win32threadpool.cpp Outdated Show resolved Hide resolved
src/gc/gc.cpp Outdated Show resolved Hide resolved
src/utilcode/util.cpp Outdated Show resolved Hide resolved
@luhenry luhenry changed the title Fix Environment.ProcessorCount when passing --cpus to Docker CLI Improve support for --cpus to Docker CLI Mar 29, 2019
@luhenry
Copy link
Author

luhenry commented Mar 29, 2019

To verify the beneficial effect of this change I ran the ASP.NET Core benchmarks with the following configurations and results:

  1. with --cpuset-cpus=4-15 --cpus=8 + without this change
RPS Max CPU (%) [1] Avg. Latency Startup
1,477,768 101 9.79 195
1,514,134 102 9.34 195
1,519,981 101 9.67 194
1,500,781 100 9.55 194
1,500,529 101 9.74 196
  1. with --cpuset-cpus=4-15 --cpus=8 + with this change
RPS Max CPU (%) [1] Avg. Latency Startup
1,534,869 101 8.31 196
1,506,254 101 8.67 196
1,525,930 100 8.68 197
1,506,759 103 9.12 201
1,541,576 102 8.94 197

We can observe a clear gain in RPS and reduction in Avg Latency.

cpu_count = quota / period;

cpu_count = (double) quota / period;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        // Cannot have less than 1 CPU
        if (quota <= period)
        {
            *val = 1;
            return true;
        }
        
        cpu_count = (double) quota / period;
        if (cpu_count < UINT32_MAX)
        {
            // round up
            *val = (uint32_t)(cpu_count + 0.999999999);
        }
        else
        {
            *val = UINT32_MAX;
}

maybe change this to:

*val = (UINT)(min((quota + period - 1)/period, UINT32_MAX))

if (PAL_GetCpuLimit(&cpuLimit) && cpuLimit < (uint32_t)processorCount)
processorCount = cpuLimit;
#endif

END_QCALL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to remove this? Doesn't that change Environment.ProcessorCount to no longer take into account --cpus?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the goal of the change.

src/vm/gcenv.os.cpp Outdated Show resolved Hide resolved
@stephentoub
Copy link
Member

I haven't followed the full discussion, but one thing to confirm:
Will the default min thread pool count still be >= Environment.ProcessorCount? For better or worse, there's a fair amount of code that depends on that.

@luhenry
Copy link
Author

luhenry commented Apr 1, 2019

@stephentoub yes, and that's after discussion with @VSadov at https://github.com/dotnet/coreclr/issues/22302#issuecomment-477386311 and performance measurements that confirms it is the better approach.

@tmds we verified that the resuts are actually better with these changes (including the change to Environment.ProcessorCount). So is there any other reason not to get it in?

@stephentoub
Copy link
Member

yes

Ok, good, thanks. It ends up being impactful in situations where a system queues Environment.ProcessorCount work items and expects that they will all be able to run concurrently (PLINQ does this for some queries, for example, where the work items it queues all need to join with each other at a barrier). While other work being executed can impact this, at least on an unloaded system if the min thread pool count >= ProcessorCount, the work items should all be able to start and run concurrently quickly. If the min thread pool count < ProcessorCount, then we immediately hit a starvation situation and it could take quite some time for the pool to ramp up to the required number of threads.

@luhenry
Copy link
Author

luhenry commented Apr 1, 2019

The Environment.ProcessorCount workitems queued to the ThreadPool will run concurrently, even though they will not be able to use 100% of each CPU they are running on - that's what --cpus does. The ThreadPool will then be able to take into account this --cpus value with https://github.com/dotnet/coreclr/issues/22302#issuecomment-477386311

@stephentoub
Copy link
Member

even though they will not be able to use 100% of each CPU they are running on

That's fine.

@tmds
Copy link
Member

tmds commented Apr 1, 2019

@tmds we verified that the resuts are actually better with these changes (including the change to Environment.ProcessorCount). So is there any other reason not to get it in?

It is not uncommon to run a lot of containers with cpu quota less than 1 on beefy servers with many cores. The benchmarks results posted here don't include that scenario. I think we may be reducing performance in that case.

@luhenry
Copy link
Author

luhenry commented Apr 3, 2019

@tmds following are results running the ASP.NET Core Benchmarks

  1. with --cpuset-cpus=4-15 --cpus=0.5 + without this change
RPS Max CPU (%) [1] Avg. Latency Startup
158940 53 31.77 505
156841 52 31.18 720
159030 52 31.19 509
155849 52 31.55 507
160123 52 31.17 408
  1. with --cpuset-cpus=4-15 --cpus=0.5 + with this change
RPS Max CPU (%) [1] Avg. Latency Startup
154878 54 32.09 430
163776 51 32.09 801
162235 50 32.08 507
148181 51 32.89 440
157830 51 30.26 487
  1. with --cpuset-cpus=4-15 --cpus=1 + without this change
RPS Max CPU (%) [1] Avg. Latency Startup
321827 106 15.13 279
330651 104 11.13 200
321547 104 14.95 321
334414 104 10.56 262
330111 104 13.32 241
  1. with --cpuset-cpus=4-15 --cpus=1 + with this change
RPS Max CPU (%) [1] Avg. Latency Startup
320951 101 13.8 339
323964 105 12.51 288
312130 101 14.19 308
323971 104 12.61 199
330520 102 13.97 242
  1. with --cpuset-cpus=4-15 --cpus=2 + without this change
RPS Max CPU (%) [1] Avg. Latency Startup
647812 104 9.91 242
641200 102 8.36 200
629749 104 11.54 234
631857 101 9.09 200
627933 101 11.19 294
  1. with --cpuset-cpus=4-15 --cpus=2 + with this change
RPS Max CPU (%) [1] Avg. Latency Startup
646432 101 8.98 196
637302 105 9.15 198
654804 101 9.82 200
635966 106 10.01 203
634108 103 9.02 264
  1. with --cpuset-cpus=4-15 --cpus=4 + without this change
RPS Max CPU (%) [1] Avg. Latency Startup
1157659 101 4.75 246
1169138 103 4.8 196
1184439 102 4.72 199
1154286 101 5.65 211
1128476 103 6.57 203
  1. with --cpuset-cpus=4-15 --cpus=4 + with this change
RPS Max CPU (%) [1] Avg. Latency Startup
1202938 105 4.25 199
1139925 101 5.09 228
1155460 101 4.84 200
1125296 102 5.04 200
1165013 102 5.27 215

We can see there is no decrease in performance with the changes, and even see increase in perfornmance with https://github.com/dotnet/coreclr/issues/22302#issuecomment-477330281 and #23398 (comment)

@tmds
Copy link
Member

tmds commented Apr 3, 2019

@luhenry thanks for doing that benchmark. Unfortunately, I don't feel confident myself about this change yet. It's not clear for me what is now taking into account --cpus and what is not, and why. Also from benchmark results, I wonder if they are a good match with the changes made here. For example, ProcessorCount changed from 1 to 12 in the last benchmark and the results are not really changing. Another thing is that GC is also affected, but these ASP.NET benchmarks have been super optimized to cause very little GCs.

@@ -148,7 +148,8 @@ class SimpleRWLock
} CONTRACTL_END;

m_RWLock = 0;
m_spinCount = (GetCurrentProcessCpuCount() == 1) ? 0 : 4000;
// Passing false here reduces ASP.NET Core Plaintext benchmark results from 1.2M to 0.8M RPS.
Copy link
Member

@jkotas jkotas Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious - where (stacktrace) does the ASP.NET Core Plaintext take this lock?

This focuses on better supporting Docker CLI's parameter `--cpus`, which limits the amount of CPU time available to the container (ex: 1.8 means 180% CPU time, ie on 2 cores 90% for each core, on 4 cores 45% on each core, etc.)

All the runtime components depending on the number of processors available are:
 - ThreadPool
 - GC
 - `Environment.ProcessorCount` via `SystemNative::GetProcessorCount`
 - `SimpleRWLock::m_spinCount`
 - `BaseDomain::m_iNumberOfProcessors` (it's used to determine the GC heap to affinitize to)

All the above components take advantage of `--cpus` via `CGroup::GetCpuLimit` with dotnet#12797, allowing to optimize performance in a container/machine with limited resources. This makes sure the runtime components makes the best use of available resources.

In the case of `Environment.ProcessorCount`, the behavior is such that passing `--cpus=1.5` on a machine with 8 processors will return `1`  as shown in https://github.com/dotnet/coreclr/issues/22302#issuecomment-459092299. This behavior is not consistent with [Windows Job Objects](https://docs.microsoft.com/en-us/windows/desktop/api/winnt/ns-winnt-jobobject_cpu_rate_control_information) which still returns the number of processors for the container/machine even if it only gets parts of the total number of cycles.

This behavior is erroneous because the container still has access to the full range of processors on the machine, and only its _processor time_ is limited. For example, in the case of a 4 processors machine, with a value of `--cpus=1.8`, there can be 4 threads running in parallel even though each thread will only get `1.8 / 8 = .45` or 45% of all cycles of each processor.

The work consist in reverting the behavior of `SystemNative::GetProcessorCount` to pre dotnet#12797.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
6 participants