Set memory and cpu requests and limit values for all containers #65

paulcwarren · 2020-08-28T20:43:26Z

Relint is currently working on a scaling and Quality of Service (QoS) set of stories.

We are targeting 1.0 to be configured out-of-the-box as a "developer" edition aimed at those users who want to kick the tyres. As part of this, we would like to set limits on mem/cpu.

Since a "developer" edition may not be preferred by everyone, we want each component to be configurable to scale both horizontally (replicas) and vertically (mem/cpu). This will also allow users to deliver a Guaranteed QoS when required (although we are recommending that all of our pods and containers use the Burstable QoS) As part of this we would like to ask you to do several things:

consider which of your pods/containers you would like to expose for scaling properties for.
expose said configuration properties.
sets mem and cpu values for all containers in order to provide as much meta-data to k8s as possible so that its scheduler can do as good a job as possible. This PR is an initial attempt at setting these values, although we know you are much more likely to have insight into your components mem/cpu requirements than our guess.

If you have any questions or concerns, please let us know! Thanks!

#174462927

Co-Authored-By: Angela Chin achin@pivotal.io

[#174462927](https://www.pivotaltracker.com/story/show/174462927) Co-authored-by: Paul Warren <pwarren@pivotal.io>

cf-gitbot · 2020-08-28T20:43:28Z

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/174559282

The labels on this github issue will be updated when the story is started.

tcdowney · 2020-08-28T21:39:16Z

templates/api_server_deployment.yml

              memory: 300Mi
            limits:
+              cpu: 1000m
              memory: 1.2Gi


I know you didn't change this, but for @cloudfoundry/cf-capi how did we choose this memory limit?

On BOSH the limit ~4GB:
https://github.com/cloudfoundry/capi-release/blob/6f8899976561e64eab8c9804b1ce772083bbf68c/jobs/cloud_controller_ng/spec#L878-L886

In production envs I've seen a well-used CF API settle in at around 2.5-3GB (occasionally bursting higher for certain requests), so I think we should make it match the capi-release limits.

It was probably a raw multiple of the initially-deployed memory usage. We set these to keep the API from getting OOM killed during test, not based on any experiment to produce memory bloat or "realistic" workloads.

That's good context, thanks @cwlbraa.

I wonder if we should just make the very forgiving for now (either meet or exceed the BOSH defaults). The main use for them is to just be a good neighbor and avoid starving other Pods of resources, but I feel like if we're targeting a "developer edition" 1.0 they might be more frustrating than helpful if they're too strict.

cwlbraa · 2020-08-28T21:46:32Z

thanks for this! the main change I'd make to what's here is:

worker, deployment_updater, and clock should all be allowed 1000m CPU, and request 300m. They're GIL'd and sometimes do compute-intensive tasks, but are mostly idle.

are y'all interested in adding the configurability to this PR? maybe to build some empathy for how cf-for-k8s component teams add optional config parameters to cf-for-k8s?

Things that should be configurable:

cf-api-server instance count
worker instance count
clock memory limit (and request?)
deployment updater memory limit (and request?)
cf-api-controllers memory limit (and request?) AND CPU limit/request

Do we expect to have 4 scaling parameters exposed for each multithreaded system-singleton, vertical-scale only pod? It seems a bit much.

njbennett · 2020-08-28T21:54:47Z

I'm not sure if this is in-scope for this PR but if we're talking about automatic horizontal scaling...

One of the more dramatic failure modes I've seen in CF-for-VMs is the behavior of the platform when the shared internal database runs out of configured connections. If UAA can't get connections then the whole control plane becomes inaccessible. Does kubernetes have any means of taking this shared resource into account in its scaling decisions?

If not, enabling horizontal autoscaling seems like it creates an availability threat to the platform, even as it resolves another one.

jamespollard8 · 2020-09-16T17:24:35Z

are y'all interested in adding the configurability to this PR? maybe to build some empathy for how cf-for-k8s component teams add optional config parameters to cf-for-k8s?

We've worked on a proposal for the Scaling Interface and it's under review now. So for now, we'd like to place configurability out of scope.

limits to 300m/1000m, as requested [#174559282](https://www.pivotaltracker.com/story/show/174559282) Co-authored-by: Eric Promislow <epromislow@suse.com>

jamespollard8 · 2020-09-16T17:34:14Z

worker, deployment_updater, and clock should all be allowed 1000m CPU, and request 300m. They're GIL'd and sometimes do compute-intensive tasks, but are mostly idle.

Done ✅

jamespollard8 · 2020-09-16T17:38:01Z

I'm not sure if this is in-scope for this PR but if we're talking about automatic horizontal scaling...

One of the more dramatic failure modes I've seen in CF-for-VMs is the behavior of the platform when the shared internal database runs out of configured connections. If UAA can't get connections then the whole control plane becomes inaccessible. Does kubernetes have any means of taking this shared resource into account in its scaling decisions?

If not, enabling horizontal autoscaling seems like it creates an availability threat to the platform, even as it resolves another one.

Great thoughts, Nat. They're out of scope for this minimal PR but I'd recommend dropping these thoughts in #cf-for-k8s slack channel to get a conversation going.

FWIW, our initial take is that this is something Kubernetes should help handle. We'd hope the Horizontal Pod Autoscaler would help here

- Bumps to kpack 0.1.2 which has multiple breaking changes from the previous version we used - Bumps to a branch of capi-k8s-release which supports the breaking changes kpack introduced - We (CAPI) will merge this into our primary branch after this change has been merged into cf-for-k8s - This also encapsulates additional changes which happened in the interim for: - cloudfoundry/capi-k8s-release#44 - cloudfoundry/capi-k8s-release#65 Co-authored-by: Sannidhi Jalukar <sjalukar@pivotal.io>

Set memory and cpu requests and limit values for all containers

104c911

[#174462927](https://www.pivotaltracker.com/story/show/174462927) Co-authored-by: Paul Warren <pwarren@pivotal.io>

cf-gitbot added the unscheduled label Aug 28, 2020

tcdowney reviewed Aug 28, 2020

View reviewed changes

ENH: Increase worker, clock, and deployment updater requests and

183b52d

limits to 300m/1000m, as requested [#174559282](https://www.pivotaltracker.com/story/show/174559282) Co-authored-by: Eric Promislow <epromislow@suse.com>

cwlbraa merged commit 19ce138 into master Sep 16, 2020

cf-gitbot removed the unscheduled label Sep 16, 2020

cf-gitbot added the delivered label Sep 16, 2020

jamespollard8 deleted the qos-scaling branch September 18, 2020 21:25

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set memory and cpu requests and limit values for all containers #65

Set memory and cpu requests and limit values for all containers #65

paulcwarren commented Aug 28, 2020 •

edited

Loading

cf-gitbot commented Aug 28, 2020

tcdowney Aug 28, 2020

cwlbraa Aug 28, 2020 •

edited

Loading

tcdowney Aug 28, 2020

cwlbraa commented Aug 28, 2020 •

edited

Loading

njbennett commented Aug 28, 2020 •

edited

Loading

jamespollard8 commented Sep 16, 2020

jamespollard8 commented Sep 16, 2020

jamespollard8 commented Sep 16, 2020

Set memory and cpu requests and limit values for all containers #65

Set memory and cpu requests and limit values for all containers #65

Conversation

paulcwarren commented Aug 28, 2020 • edited Loading

cf-gitbot commented Aug 28, 2020

tcdowney Aug 28, 2020

Choose a reason for hiding this comment

cwlbraa Aug 28, 2020 • edited Loading

Choose a reason for hiding this comment

tcdowney Aug 28, 2020

Choose a reason for hiding this comment

cwlbraa commented Aug 28, 2020 • edited Loading

njbennett commented Aug 28, 2020 • edited Loading

jamespollard8 commented Sep 16, 2020

jamespollard8 commented Sep 16, 2020

jamespollard8 commented Sep 16, 2020

paulcwarren commented Aug 28, 2020 •

edited

Loading

cwlbraa Aug 28, 2020 •

edited

Loading

cwlbraa commented Aug 28, 2020 •

edited

Loading

njbennett commented Aug 28, 2020 •

edited

Loading