Skip to content
This repository has been archived by the owner on Jan 21, 2022. It is now read-only.

compute_cpu_usage should consider the cpu core number #93

Closed
libnux opened this issue Dec 20, 2013 · 11 comments
Closed

compute_cpu_usage should consider the cpu core number #93

libnux opened this issue Dec 20, 2013 · 11 comments

Comments

@libnux
Copy link

libnux commented Dec 20, 2013

In current implementation, for apps deployed on a multi-core DEA/Warden, it's possible that /stats reports CPU usage with values larger than 1(100%).

For example, if DEA node has a 2-core cpu, the biggest possible cpu usage will be 2(200%).

Since the cpu core information is unknown to end users, this cpu usage value should be divided by the cpu core number.
Thus in our example, the cpu usage will be 200%/2 = 100%, which makes more sense to end users than 200%.

@sykesm
Copy link
Contributor

sykesm commented Dec 20, 2013

For what it's worth, I don't agree. What %cpu reports is the nanoseconds of cpu used by an app instance divided by the time since the last measurement was taken. When you see more than 100%, that indicates you've had more CPU time than wall clock time - an indication that you're using more than one instruction stream of processing power.

As for what you don't know when you push your app, you don't know that all DEAs will have the same CPU capacity, the same CPU count, or that the virtual CPUs map directly to physical processors. The way the CPU is reported today normalizes all that away.

@libnux
Copy link
Author

libnux commented Dec 20, 2013

If I want to do actions based on the cpu usage reported by CF stats, how should I set the threshold if the cpu usage could vary from 0% to an undetermined max value?

If my app is deployed on a DEA node with 1 cpu core, the max will be 100%; for 8 cores, the max could be 800%.

Maybe CF stats can return the cpu core number. However, this is not implemented and the value is always zero.

@sykesm
Copy link
Contributor

sykesm commented Dec 20, 2013

Honestly, if you're trying to take action based on CPU% thresholds, you're going to have a number of challenges. As I said, from the perspective of someone using the command line, you know nothing about the infrastructure. You don't know anything about other applications that are deployed so you have no idea why CPU% changes. While a change could be due to your app getting more or less traffic, it could be because you were given more cycles when other (possibly unrelated) application instances on that DEA were stopped.

A PaaS is not an IaaS from a consumer perspective.

@libnux
Copy link
Author

libnux commented Dec 20, 2013

I understand that it's hard to get the real cpu usage for an app runing in a PaaS which uses light-weight containers.
My questions are,

  1. What's the purpose of the CPU% from the perspective of CloudFoundry? For information?
  2. Is it possible for DEA to report the CPU core number used for a Warden app? In CF client Java API, there is a field named "cores" in the InstanceStats class. However, this is not implemented at back-end.

Thanks.

@caseymct
Copy link

Hi @libnux @sykesm,

@libnux let's see if we can get to your base use case. What action would you take based on CPU thresholds? Could you use a latency metric instead, or an overall health metric that you write and provide for your apps?

@sykesm point about a PaaS not being an IaaS is a good one. Besides not knowing what the host system is, we don't guarantee where you run or what size the host system is. You could potentially go from AWS m1.medium to c2 .XL and you'd likely never know.

Also - we believe that the CPU percentage is leftover from when we used to run apps as a single process on a monolithic kernel. @MarkKropf any comment or any path forward for this kind of instrumentation?

@thansmann && @caseymct

@libnux
Copy link
Author

libnux commented Dec 21, 2013

Hi @caseymct ,

My point is a CPU% larger than 100% is some confusing to end users/operators. It could be normalized by being divided by the cpu core number and then it would be between [0%, 100%], and be more meaningful and less confusing to end users.

Regarding the CPU cores, is there any planning about it?

I found following comments in the lib/dea/protocol.rb file and so the cores field is not populated and returned for now.( As a result, In CF client Java API, the "cores" property of the InstanceStats class is always zero)

"

Purposefully omitted, as I'm not sure what purpose it serves.

cores

"

@MarkKropf
Copy link

I tend to agree with @libnux when CPU% is presented to an app developer. If this will be displayed to the user, then they want to know what % of their CPU capacity is being used. It's not really important to them if that's 1vcpu, 8cpu, or 20 ecu's.

As a CF operator I want to know what the real impact of an application is to the total CPU capacity of my machine/vm/instance/whatever. Seeing more granular data per core of the system may be beneficial, although there is already sufficient tooling in this space.

@sykesm
Copy link
Contributor

sykesm commented Jan 9, 2014

@MarkKropf , my point is that the user doesn't know what their capacity is so giving them a percentage that ranges from 0-100% of the DEA's capacity isn't any more or less helpful than giving them a percentage that's normalized as compute time used over an interval. You could change the approach to presenting the statistic but it really isn't informative without additional context. Providing that additional context, however, complicates things even further.

In my opinion, this all boils down to whether people are used to the Irix mode of CPU reporting or the Solaris mode (Irix mode is the default on Linux) but, in the end, it doesn't matter.

@MarkKropf
Copy link

@sykesm Maybe I'm not giving the average user enough credit. The only unit a user has to impact their capacity is # of instances. With that, 0-100% gives a good insight into what percentage of their capacity is being used. Knowing the 'real' CPU usage doesn't enable them to necessarily make decisions differently. Do you disagree?

@aristotelesneto
Copy link

Apologies for sticking my nose in this, but I also got interested in to how this is reported.

@MarkKropf, If that value is reflecting the DEA capacity usage, then its not necessarily 'their' capacity thats used, but the usage of the DEA as a whole. If the app owner is using negligible amounts of CPU time, yet if the DEA is at full capacity due to a noisy neighbour scenario would cf stats still reflect 100%?

It would seem inappropriate to expect the app owner to increase / decrease the # of instances based on that value then. Speaking of noisy neighbour, how does the DEA / Warden handle noisy neighbours?

Correct me if I'm wrong, but I think this is what @sykesm was trying to refer to.

As for the original issue, though, I think it definitely makes sense to report 0-100, rather than 0-(*100). Whether app owners should take action based on such value seems to be a different discussion altogether.

@sykesm
Copy link
Contributor

sykesm commented Jan 10, 2014

@wdneto, the cpu accounting subsystem for linux cgroups reports the nanoseconds of CPU consumed by a container; warden simply surfaces that to the DEA. The DEA polls that statistic (every 10 seconds by default), determines the difference from the last poll, and then divides that by the elapsed time since the last poll. This is what get surfaced as the CPU% consumed. When more nanos of compute have been consumed than nanos elapsed, the CPU% exceeds 100%.

For noisy neighbors, there's nothing in place today to really deal with it. The kernel is attempting to do fair share scheduling across all cgroups and each app instance is, essentially, its own cgroup. A change was recently made to give staging preference over executing apps, but that's the extent of what exists today.

@MarkKropf, it is true that the only thing a user can influence is number of instances. As the number of instances grows, you get more pieces of the pie, but each piece gets slightly smaller. Unfortunately, you don't know many pieces there really are in the pie. Because of that, no matter what you report, CPU% for app instances isn't actionable as a scaling metric to end users.

Again, in the end you can report CPU% either way. I happen to believe that the existing mechanism works well; it's not "broken." The fact that it's consistent with the linux default reporting of Irix is a bonus. ;)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants